packet forwarding capabilities

Upload: miyanito

Post on 05-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Packet Forwarding Capabilities

    1/14

    Understanding the Packet Forwarding Capability

    of General-Purpose Processors

    Katerina Argyraki, Kevin Fall, Gianluca Iannaccone,

    Allan Knies

    , Maziar Manesh

    , Sylvia Ratnasamy

    EPFL, Intel Research

    AbstractCompared to traditional high-end network equipment

    built on specialized hardware, software routers run-

    ning on commodity servers offer significant advan-

    tages: lower costs due to large-volume manufacturing,

    a widespread supply/support chain, and, most impor-

    tantly, programmability and extensibility. The challenge

    is scaling software-router performance to carrier-level

    speeds. As a first step, in this paper, we study the packet-

    processing capability of modern commodity servers; we

    identify the packet-processing bottlenecks, examine towhat extent these can be alleviated through upcoming

    technology advances, and discuss what further changes

    are needed to take software routers beyond the small en-

    terprise.

    1 Introduction

    To what extent are general-purpose processors capable of

    high-speed packet processing? The answer to this ques-

    tion could have significant implications for how future

    network infrastructure is built. To date, the development

    of network equipment (switches, routers, various middle-boxes) has focussed primarily on achieving high perfor-

    mance for relatively limited forms of packet processing.

    However, as networks take on increasingly sophisticated

    functionality (e.g., data loss protection, application ac-

    celeration, intrusion detection), and as major ISPs com-

    pete in offering new services (e.g., video, mobility sup-

    port services), there is an increasing need for network

    equipment that is programmable and extensible. And in-

    deed, both industry and research have already taken ini-

    tial steps to tackle the issue [4, 6, 7, 9, 21].

    In current networking equipment, high performance

    and programmability are competing goalsif not mu-

    tually exclusive. On the one hand, we have high-endswitches and routers that rely on specialized hardware

    and software and offer high performance, but are no-

    toriously difficult to extend, program, or otherwise ex-

    periment with. On the other hand, we have software

    routers, where all significant packet-processing steps are

    performed in software running on commodity PC/server

    platforms; these are, of course, easily programmable, but

    only suitable for low-packet-rate environments such as

    small enterprises [6].

    The challenge of building network infrastructure that

    is both programmable and capable of high performance

    can be approached from one of two extreme starting

    points. One approach would be to start with existing

    high-end, specialized devices and retro-fit programma-

    bility into them. For example, some router vendors have

    announced plans to support limited APIs that will allow

    third-party developers to change/extend the software part

    of their products (which does not typically involve core

    packet processing) [7,9]. A larger degree of programma-

    bility is possible with network-processor chips, whichoffer a semi-specialized option, i.e., implement only

    the most expensive packet-processing operations in spe-

    cialized hardware and run the rest on programmable pro-

    cessors. While certainly an improvement we note that,

    in practice, network processors have proven hard to pro-

    gram: in the best case, the programmer needs to learn a

    new language; in the worst, she must be aware of (and

    program to avoid) low-level issues like resource con-

    tention during parallel execution or expensive memory

    accesses [14, 16].

    From the opposite end of the spectrum, a different ap-

    proach would be to start with software routers and op-

    timize their packet-processing performance. The allure

    of this approach is that it would allow network infras-

    tructure to tap into the many desirable properties of the

    PC-based ecosystem, including lower costs due to large-

    volume manufacturing, rapid advances in power manage-

    ment, familiar programming environment and operating

    systems, and a widespread supply/support chain. In other

    words, if feasible, this approach could enable a network

    infrastructure that is programmable in much the same

    way as end-systems are today. The challenge is taking

    this approach beyond the small enterprise, i.e., scaling

    PC/server packet-processing performance to carrier-level

    speeds.It is perhaps too early to tell which approach dom-

    inates; in fact, its more likely that each approach re-

    sults in different tradeoffs between programmability and

    performance, and these tradeoffs will cause each to be

    adopted where appropriate. As yet, however, there has

    been little research exposing what tradeoffs are achiev-

    able. As a first step in this direction, in this paper, we

    explore the performance limitations for packet process-

    ing on commodity servers.

  • 7/31/2019 Packet Forwarding Capabilities

    2/14

    A legitimate question at this point is whether the per-

    formance requirements for network equipment are just

    too high and our exploration is a fools errand. The bar

    is indeed high: in terms of individual link/port speeds,

    10Gbps is already widespread and 40Gbps is being de-

    ployed at major ISPs; in terms of aggregate switching

    speeds, carrier-grade routers range from 40Gbps to ahigh of 92Tbps! Two developments, however, lend us

    hope. The first is a recent research proposal [11] that

    presents a solution whereby a cluster of N servers can

    be interconnected to achieve aggregate switching speeds

    of NR bps, provided each server can process packets at

    a rate on the order of R bps. This result implies that, in

    order to scale software routers, it is sufficient to scale a

    single server to individual line speeds (10-40Gbps) rather

    than aggregate speeds (40Gbps-92Tbps). This reduction

    makes for a much more plausible target.

    Secondly, we expect that the current trajectory in

    server technology trends will work in favor of packet-

    processing workloads. For example, packet processingappears naturally suited to exploiting the tremendous

    computational power that multicore processors offer par-

    allel applications. Similarly, I/O bandwidth has gained

    tremendously by the transition from PCI-X to PCIe al-

    lowing 10Gbps Ethernet NICs to enter the PC market [1].

    And finally, as we discuss in Section 4, the impending ar-

    rival of multiprocessor architectures with multiple inde-

    pendent memory controllers should offer a similar boost

    in available memory bandwidth.

    While there is widespread awareness of these ad-

    vances in server technology, we find little comprehensive

    evaluation of how these advances can/do translate into

    performance improvements for packet-processing work-

    loads. Hence, in this paper, we undertake a measurement

    study aimed at exploring these issues. Specifically, we

    focus on the following questions:

    what are the packet-processing bottlenecks in mod-

    ern general-purpose platforms;

    what (hardware or software) architectural changes

    can help remove these bottlenecks;

    do the current technology trends for general-

    purpose platforms favor packet processing?

    As we shall see, answering these seemingly straight-

    forward questions, requires a surprising amount of

    sleuthing. Modern processors and operating systems are

    both beasts of great complexity. And while current hard-

    ware/software offer extensive hooks for measurement

    and system profiling these can be equally overwhelm-

    ing. For example, current x86 processors have over 400

    performance counters that can be programmed for de-

    tailed tracing of everything from branch mispredictions

    to I/O data transactions. Its thus easy (as we discovered)

    to sink in a morass of performance monitoring data. Part

    of our contribution is thus a methodology by which to go

    about such an evaluation. Our study adopts a top-down

    approach in which we start with black-box testing and

    then recursively identify and drill down into only those

    aspects of the overall system that merit further scrutiny.Finally, it is important to note that even though our

    study stemmed from an interest in programmable net-

    work infrastructure, our findings are relevant to more

    than just the network context. Packet processing is just

    one instance of a more general class of stream based ap-

    plications (such as real time video delivery, stock trading,

    etc.). Our findings apply equally to these too.

    The remainder of this paper is organized as follows.

    We start the paper in Section 2 with some high-level anal-

    ysis estimating upper bounds on the packet processing

    performance for different server architectures. Section 3

    follows this with a measurement study aimed at iden-

    tifying the bottlenecks and overheads on these servers.We present the inferences from our measurement study

    in Section 4 and discuss potential improvements in Sec-

    tion 5. We discuss related work in Section 6 and finally

    conclude.

    2 Optimistic Back-of-the-Envelope Analysis

    Before delving into experimentation, we would like to

    calibrate our expectations. We thus start with a simple

    thought experiment aimed at estimating absolute upper

    bounds on the packet forwarding performance of bothexisting and next-generation servers. Since our goal is

    quick calibration, our reasoning here is deliberately both

    coarse-grained and optimistic; the experimental results

    that follow will show where reality lies.

    Figures 1 and 2 present a high-level view of two server

    architectures: Fig.1 depicts a traditional shared-bus ar-

    chitecture used in current x86 servers [3], while Fig.2

    represents a point-to-point architecture as will be sup-

    ported by the next-generation of x86 servers [8].

    In the shared-bus architecture, communication be-

    tween the CPUs, memory, and I/O is routed through the

    chipset that includes the memory and I/O bus con-

    trollers. There are three main system buses in this archi-tecture. The front side bus (FSB) is used for communi-

    cation both between different CPUs and between a CPU1

    and the chipset. The PCIe bus connects I/O devices, in-

    cluding network interfaces, to the chipset via one or more

    high-speed serial channels known as lanes and, finally,

    the memory bus connects the memory and chipset.

    1In this paper we will use the terms CPU, socket and processor in-

    terchangeably to refer to a multi-core processor.

    2

  • 7/31/2019 Packet Forwarding Capabilities

    3/14

    Figure 1: Traditional shared bus architecture Figure 2: Point-to-point architecture

    The point-to-point server (Fig.2) represents two sig-

    nificant architectural changes relative to the above: first,

    the FSB is replaced by a mesh of dedicated point-to-point

    links thus removing a potential bottleneck for inter-CPU

    communication. Second, the point-to-point architecture

    replaces the single external memory controller shared

    across CPUs with a memory controller integrated within

    each CPU; this leads to a dramatic increase in aggregatememory bandwidth, since each CPU now has a dedicated

    link to a portion of the overall memory space. Servers

    based on such point-to-point architectures and with upto

    32 cores (4 sockets and 8 cores/socket) are due to emerge

    in the near future [10].

    To estimate a servers packet-forwarding capability,

    we consider the following minimal set of operations typ-

    ically required to forward an incoming packet and the

    corresponding load they impose on each of the primary-

    system components:

    1. The incoming packet is DMA-ed from the network

    card (NIC) to main memory (incurring one transac-tion on the PCIe and memory bus).

    2. The CPU reads the packet header (one transaction

    on the FSB and memory bus).

    3. The CPU performs any necessary packet processing

    (CPU-only, assuming no bus transactions).

    4. The CPU writes the modified packet header to

    memory (one transaction on the memory bus and

    FSB).

    5. The packet is DMA-ed from memory to NIC (onetransaction on the memory and PCIe bus).

    Figures 1 and 2 also show the manner in which each of

    these operations maps onto the various system buses for

    the architecture in question. As we see, for the shared-

    bus architecture, a single packet results in 4 transactions

    on the memory bus and 2 on each of the FSB and PCIe

    buses; thus, a line rate of R bps leads to (roughly) a load

    of 4R, 2R, and 2R on each of the memory, FSB, and

    PCIe buses.2 Currently available technology advertises

    memory, FSB, and PCIe bandwidths of approximately

    100Gbps, 85Gbps, and 64Gbps respectively (assuming

    DDR2 SDRAM at 800MHz, a 64-bit wide 1.33GHz

    FSB, and 32-lane PCIe1.1); these numbers suggest that

    a current shared-bus architecture could sustain line rates

    up to R = 25 Gb/s.

    For the point-to-point architecture, each packet con-

    tributes 4 memory-bus transactions, 4 transactions on the

    inter-socket point-to-point links, and 2 PCIe transactions;

    since we have 4 memory buses, 6 inter-socket links and

    4 PCIe links, assuming uniform load distribution across

    the system, a line rate of R bps yields loads of R, 2R/3,and R/2 on each of the memory, inter-socket, and PCIebuses respectively. If we (conservatively) assume simi-

    lar technology constants as before (memory, inter-socket,

    and PCIe bandwidths at 100Gbps, 85Gbps, and 64Gbps

    respectively) this suggests a point-to-point architecture

    could scale to line rates of 40Gb/s and even higher.

    In terms of CPU resources: If we assume min-sizedpackets of 40 bytes, then the packet interarrival time is

    32ns for speeds of R =10Gb/s and 8ns for R =40Gb/s.For the shared-bus server with 8 CPU, each with a speed

    of 3 GHz (available today), this implies a budget of

    3072 and 768 cycles/pkt for line rates of 10Gbps and

    40Gbps respectively. Assuming a cycles-per-instruction

    (CPI) ratio of 1, this suggests a budget of 3072 (768) in-

    structions per packet for line rates of 10Gb/s (40Gb/s).

    With 32 cores at similar speeds, the point-to-point server

    would see a budget of 12288 and 3072 instructions/pkt

    for 10Gb/s and 40Gb/s respectively.

    In summary, based on the above, one might conclude

    that current shared-bus architectures may scale to 25Gb/sbut not 40Gb/s, while emerging servers may scale even

    to 40Gb/s.

    2This estimate assumes that the entire packet (rather than the

    header) is read to/from the memory and CPU for packet processing.

    A more accurate estimate would account for packet header sizes (or

    cache line sizes if smaller than header lengths). We ignore this here

    since our tests in the following section consider only min-sized packets

    of 64 bytes, equal to a cache line length because of which the inaccu-

    racy is of little relevance.

    3

  • 7/31/2019 Packet Forwarding Capabilities

    4/14

    3 Measurement-based Analysis

    We now turn to experimentation. We first de-

    scribe our experimental setup and then present the

    packet-forwarding rates achieved by unmodified soft-

    ware/hardware.

    Experimental Setup For our experiments, we use a

    mid-level server machine running SMP Click [18]. Our

    server is a dual-socket 1.6GHz quad-core CPU with an

    L2 cache of 4MB, two 1.066GHz FSBs (one to each

    socket) and 8 GBytes of DDR2-667 SDRAM. With the

    exception of the CPU speeds, these ratings are similar to

    the shared-bus architecture from Figure 1 and, hence, our

    results should be comparable. The machine has a total of

    16 1GigE NICs. To source/sink traffic, we use two addi-

    tional servers each of which is connected to 8 of the 16

    NICs on our test machine. We generate (and terminate)

    traffic using similar servers with 8 GigE NICs. We instru-

    ment our servers with Intel EMON, a performance mon-

    itoring tool similar to Intel VTune, as well as a chipset-

    specific tool that allows us to monitor memory-bus us-

    age.3

    The forwarding rates achieved will depend on the na-

    ture of the traffic workload. To a first approximation,

    this workload can be characterized by: (1) the incom-

    ing packet arrival rate r, measured in packets/sec, (2) the

    size of packets, P, measured in bytes (hence the incom-

    ing rate R = rP) and (3) the processing per packet.We focus on evaluating the fundamental capability of the

    system to move packets through and, hence, start by con-

    sidering only the first two factors (packet rate and size)

    without considering any sophisticated packet processing.

    Hence, we remove the IP routing components from our

    Click configuration and only implement simple forward-

    ing that enforces a route between source and destination

    NICs; i.e., packets arriving on NIC #0 are sent to NIC

    #1, NIC #2 to NIC #3 and so on. We have 16 NICs and,

    hence, use 8 kernel threads, each pinned to one core and

    each in charge of one input/output NIC pair. In the re-

    sults that follow, where the input rate to the system is un-

    der 8Gbps, we use one of our traffic generation servers

    as the source and the other as sink; for tests that require

    higher traffic rates each server acts as both source and

    sink allowing us to generate input traffic up to 16Gbps.

    Measured performance We start by looking at the

    loss-free forwarding rate the server can sustain (i.e.,

    without dropping packets) under increasing input packet

    3Although our tools are proprietary, many of the measures they re-

    port are derived from public performance counters and, in these cases,

    our tests are reproducible. In an extended technical report, we will

    present in detail how our measures, when possible, can be derived from

    the public performance counters available on x86 processors.

    0 2 4 6 8 10 12 14 160

    5

    10

    15

    offered load (Gbps)

    sustainedload(Gbps)

    64 bytes

    128 bytes

    256 bytes

    512 bytes

    1024 bytes

    Figure 3: Forwarding rate under increasing load for dif-

    ferent packet sizes.

    rates and for various packet sizes. We plot this sustained

    rate in terms of both bits-per-second (bps) and packets-

    per-second (pps) in Figures 3 and 4 respectively. We see

    that, in the case of larger packet sizes (1024 bytes and

    higher), the server scales to 14.9 Gbps and can keep up

    with the offered load up to the maximum traffic we can

    generate given the number of slots on the server; i.e.,

    packet forwarding isnt limited by any bottleneck inside

    the server. However, in the case of 64 byte packets, we

    see that performance saturates at around 3.4 Gbps, or 6.4

    million pps. As Figure 4 suggests, the server is troubled

    by the high input packet rate (pps) rather than bit rate(bps). Note that the case of 64 byte packets is the worst-

    case traffic scenario. Though unlikely in reality, it covers

    an important role as it is considered the reference bench-

    mark by network equipment manufactures.

    Relative to the back-of-the-envelope estimates we ar-

    rived at in the previous section, we can conclude that,

    while our server approaches the estimated rates for larger

    packet sizes, for small packets, the achievable rates are

    well below our estimates. At a high-level, our reasoning

    could have been wildly off-target for two reasons: (1)

    in assuming that the nominal/advertised rates for each

    system component (PCIe, memory, FSB) are attainablein practice and/or (2) in our estimates of the overhead

    per packet (4x, 2x, etc.). In what follows, we look into

    each of these possibilities. In Section 3.1 we attempt to

    track down the bottleneck(s) that limit(s) the forwarding

    rate for small packets and, in so doing, estimate attain-

    able performance limits for the different system compo-

    nents. In Section 3.2 we take a closer look at the per-

    packet overheads and attempt to deconstruct these into

    their component causes.

    4

  • 7/31/2019 Packet Forwarding Capabilities

    5/14

    0 1 2 3 4 5 6 7 80

    1

    2

    3

    4

    5

    6

    7

    8

    offered load (Mpps)

    sustainedload(Mpps)

    64 bytes

    128 bytes

    256 bytes

    512 bytes

    1024 bytes

    Figure 4: Forwarding rate under increasing load for dif-

    ferent packet sizes in pps.

    3.1 Bottleneck Analysis

    We look for the bottleneck through a process of elimi-

    nation, starting with the four major system components

    discussed earlier the CPUs and the three system buses

    and drilling deeper as and when it appears warranted.

    CPU The CPUs are plausible candidates, since CPU

    processing depends on the incoming packet rate, and

    performance saturates as soon as we reach a specific

    packet rate (the same for 64-byte and 128-byte packets,

    as shown in Figure 4). Note that the traditional metric

    of CPU utilization reveals little here, because Click op-

    erates in a pure polling mode, where the CPUs are al-

    ways 100% utilized. Instead, we look at the number of

    empty polls i.e., the number of times the CPU polls

    for packets to process but none are available in memory.

    Our measurements reveal that, even at the saturation rate

    (3.4Gbps for 64-byte packets), we still see a non-trivial

    number of empty polls approximately 62,000 per sec-

    ond for each core. Hence, we eliminate CPU processing

    as a candidate bottleneck.

    System buses Our tools allow us to directly measure

    the load in bits/sec on the FSB and memory bus; the

    load difference between these two buses gives us an es-

    timate for the PCIe load. Note that this is not always agood estimate, as FSB bandwidth can be consumed by

    inter-socket communication which does not figure on the

    memory bus; however, it does make sense in our par-

    ticular setup (with each input/output port pair consis-

    tently served by the same socket) which yields little inter-

    socket communication. Figures 5 and 6 plot the load on

    each of the FSB, memory, and PCIe buses for 64-byte

    and 1024-byte packets under increasing input rates. We

    see that, for any particular line rate, the load on all three

    0 1 2 3 4 50

    5

    10

    15

    20

    25

    30

    35

    offered load (Gbps)

    observerdload(Gbps)

    Memory

    I/O

    FSB

    Figure 5: Bus bandwidths for 64 byte packets.

    0 5 10 150

    10

    20

    30

    40

    50

    offered load (Gbps)

    observerdload(Gbps)

    Memory

    I/O

    FSB

    Figure 6: Bus bandwidths for 1024 byte packets.

    buses is always higher with 64-byte packets than with

    1024-byte ones. Hence, any of the buses could be the

    bottleneck, and we proceed to examine each one more

    closely.

    FSB Under the covers, the FSB consists of separate

    data and address buses, and our tools allow us to sep-

    arately measure the utilization of each. The results are

    shown in Figures 7 and 8: while it is clear that the

    data bus is under-utilized, it is not immediately obvious

    whether this is the case for the address bus as well. Togage the maximum attainable utilization on each bus, we

    wrote a simple benchmark program (we will refer to it as

    the stream benchmark from now on) that creates and

    writes to a very large array. This benchmark consumes

    50 Gbps of FSB bandwidth that translate into 37% data-

    bus utilization and 74% address-bus utilization. These

    numbers are well above the utilization levels from our

    packet-forwarding workload, which means that the latter

    does not saturate the FSB. Hence, we conclude that the

    5

  • 7/31/2019 Packet Forwarding Capabilities

    6/14

    0 1 2 3 4 50

    20

    40

    60

    80

    100

    offered load (Gbps)

    busutiliz

    ation(%)

    FSB Data

    FSB Address

    Figure 7: FSB data and address bus utilization for 64

    byte packets.

    0 5 10 150

    20

    40

    60

    80

    100

    offered load (Gbps)

    busutilization(%)

    FSB DataFSB Address

    Figure 8: FSB data and address bus utilization for 1024

    byte packets.

    FSB is not the bottleneck.

    PCIe Unlike the FSB where all operations are in fixed-

    size units (64 bytes a cache line), the PCIe bus sup-

    ports variable-length transfers; hence, if the PCIe bus is

    the bottleneck, this could be due either to the incoming

    bit rate or to the requested operation rate (which depends

    on the incoming packet rate). To test the former, we sim-

    ply look at the maximum bit rate that the PCIe bus hassustained; from Figures 5 and 6, we see that the maxi-

    mum PCIe load for 1024-byte packets exceeds the PCIe

    load recorded at saturation for 64-byte packets. Hence,

    the PCIe bit rate is not the problem.

    To test whether the PCIe operation rate is the bottle-

    neck, we measure the maximum packet rate that can be

    sustained by each individual PCIe lane. Our rationale

    is the following: given that PCIe lanes are independent

    from each other, if we can successfully drive the packet

    1 2 3 4 5 6 7 80

    1

    2

    3

    4

    Cores

    Gbps

    Forwarding rate

    1 2 3 4 5 6 7 80

    5

    10

    15

    20

    25

    Cores

    Gbps

    I/O load

    Figure 9: Forwarding rate (top) and PCIe load (bottom)

    for 64 byte packets as a function of the number of cores

    rate on a single lane beyond the per-lane rate recorded

    at saturation, then we will have showed that our packet-forwarding workload does not saturate the PCIe lanes

    and, hence, packet rate is not the problem. To this end, we

    start with a single pair of input/output ports pinned to a

    single core and gradually add ports and cores. The results

    are shown in Figure 9, where we plot both the sustained

    forwarding rate and the PCIe load: we already know that

    for 64-byte packets, at saturation, each input/output port

    pair sustains approximately 0.4Gbps (Figure 3); from

    Figure 9, we see that each individual port pair (and,

    hence, the corresponding PCIe lanes) can go well beyond

    that rate (approx. 0.75Gbps). Hence, we conclude that

    the PCIe bus is not the bottleneck either.

    Memory This leaves us with the memory bus as the

    only potential culprit. To estimate the maximum attain-

    able memory bandwidth, we use the stream benchmark

    described above, which consumes 51Gbps of memory-

    bus bandwidth. This is about 35% higher than the

    33Gbps maximum consumed by our 64-byte packet-

    forwarding workload, surprisingly suggesting that aggre-

    gate memory bandwidth is not the bottleneck either.4

    This would seem to return us to square one. However,

    memory-system performance is notoriously sensitive to

    details like access patterns and load balancing; hence, we

    look further into these details.We consider two potential reasons why our packet-

    forwarding workload might reduce memory-system ef-

    4Note that even 51Gbps is fairly low relative to the nominal rating

    of 100Gbps we used in estimating upper bounds. It turns out this limit

    is due to saturation of the address bus; recall that the address bus uti-

    lization is 74% for the stream test; prior work [24] and discussions with

    architects reveal that an address bus is regarded as saturated at approx-

    imately 75% utilization. This is in keeping with the general perception

    that, in a shared-bus architecture, the vast majority of applications are

    bottlenecked on the FSB.

    6

  • 7/31/2019 Packet Forwarding Capabilities

    7/14

    1

    2

    3

    4

    1

    2

    3

    4

    0

    2

    4

    6

    8

    10

    rankbank

    Gbps

    1

    2

    3

    4

    1

    2

    3

    4

    0

    2

    4

    6

    8

    10

    rankbank

    Gbps

    1

    2

    3

    4

    1

    2

    3

    4

    0

    2

    4

    6

    8

    10

    rankbank

    Gbps

    Figure 10: Memory load distribution across banks and ranks. Left: 64 byte packets. Middle: 1024 byte packets. Right:

    the stream benchmark.

    ficiency relative to the stream benchmark. The first is the

    fact that the sequence of memory locations accessed due

    to our workload is highly irregularas opposed to the

    nicely in-sequence access pattern of the stream bench-

    mark. To assess the impact of irregular accesses, we re-

    run the stream benchmark but, instead of writing to each

    array entry in sequence, we write to random locations.

    This modification does cause a drop in memory band-

    width, but the drop is modest (from 51Gbps to about

    46Gbps), indicating that irregular accesses are not the

    problem.

    The second reason is sub-optimal use of the physical

    memory space: The memory system is internally orga-

    nized as multiple memory channels (or branches) each

    of which is organized as a grid of ranks and banks. In

    particular, the 8GB memory on our machine consists of

    two memory channels each of which comprises a grid of4 banks 4 ranks; our tools reports the memory traf-

    fic to different rank/bank pairs aggregated across mem-

    ory channels; i.e., the memory traffic we report for (say)

    a pair (bank 1, rank3) is the sum of the traffic seen on

    (bank 1, rank 3) for each of the two memory channels.

    Figure 10 shows the distribution of memory traffic over

    the various ranks and banks for three workloads: (1) 64-

    byte packets at the saturation rate of 3.4Gbps, (2) 1024-

    byte packets at 15.2Gbps, and (3) the stream benchmark.

    Notice that, while memory traffic is perfectly balanced

    for the stream benchmark (and reasonably balanced for

    the 1024-byte packet workload), for the 64-byte packet

    workload, it is all concentrated on two rank/bank ele-

    ments (in reality, we see one overloaded rank-bank pair

    on each channel; since the figure shows the aggregate

    load over the two channels).

    This result suggests that the bottleneck is not the ag-

    gregate memory bandwidth, but the bandwidth to the in-

    dividual rank/bank elements that, for some reason, end

    up carrying most of the 64-byte packet workload. To ver-

    ify this, we measure the maximum attainable bandwidth

    to a single rank/bank pair; we do this through a sim-

    ple test that creates multiple threads, all of which con-

    tinuously read and write a single locations in memory.

    The result is 7.2Gbps of memory traffic (all on a sin-

    gle rank/bank pair), which is almost equal to the max-

    imum per-rank/bank load recorded at saturation for the

    64-byte packet workload. We should note that both the

    CPUs and the FSB are under-utilized during this mem-

    ory test. Hence, we conclude that the bottleneck is the

    memory system, not because it lacks the necessary ca-

    pacity, but because of the imbalance in accessed memory

    locations.

    We now look into why this imbalance takes place.

    We see from Figure 10 that, for 1024-byte packets, the

    memory load is much better distributed than for 64-

    byte packets. This leads us to suspect that the imbal-

    ance is related to the manner in which packets are laidout onto the rank/bank grid. We test this with an exper-

    iment where we maintain a fixed packet rate (400,000

    packets/sec) and measure the resulting memory-load dis-

    tribution for different packet sizes (64 to 1500 bytes).

    Figure 11 shows the outcome: Ignoring for the moment

    the load on {bank 2, rank 3}, we observe that, as the

    packet size increases, the additional memory load is dis-

    tributed over increasing numbers of rank/bank pairs and,

    within a single memory channel, this spilling over to

    additional ranks and banks happens at the granularity of

    64 bytes; for example, for 256-byte packets, we see in-

    creased load on 3 rank/bank pairs, for 512-byte packets

    on 4 rank/bank pairs, and so forth. Moreover, we observethat this growth starts from the low-ordered banks, i.e.,

    bank 1 is loaded first, then bank 2 and so on.5

    These observations lead us to the following theory: the

    default packet-buffer size in Linux is 2KB; each such

    5Regarding the high load on {bank 2, rank 3}: we suspect this is

    caused by the large number of empty polls that we see at the low packet

    rate for this test, and that the location corresponds to the memory-

    mapped I/O register being polled. We find the load on this rank-bank

    drops with increasing packet rates further supporting this conjecture.

    7

  • 7/31/2019 Packet Forwarding Capabilities

    8/14

    1 23 4

    12

    34

    0

    2

    4

    rank

    64 bytes

    bank

    Gbps

    1 23 4

    12

    34

    0

    2

    4

    128 bytes

    1 23 4

    12

    34

    0

    2

    4

    256 bytes

    1 23 4

    12

    34

    0

    2

    4

    512 bytes

    1 23 4

    12

    34

    0

    2

    4

    1024 bytes

    1 23 4

    12

    34

    0

    2

    4

    1500 bytes

    Figure 11: Memory load distribution across banks and

    ranks for different packets sizes and a fixed packet rate.

    buffer spans the entire rank/bank grid, which would al-

    low high memory throughput if we were using the entire

    2KB allocation. However, our 64-byte packet workload

    ends up using only one of the rank/bank pairs on each

    of the memory channels, leading to the two spikes we

    see in Figure 10. To test this theory, we repeat our ear-

    lier experiment with 64-byte packets from Figure 10, but

    now change the default buffer allocation size to 1KB.

    If our theory is right and a 2KB address space spans

    the entire grid, then 1KB should span half the grid and,

    hence, the two spikes in Figure 10 should split into 4

    spikes. Figure 12 shows that this is indeed the case. Un-

    fortunately (for some reason we do not fully understand

    as yet), we have not been able to allocate yet smallerbuffer sizes (e.g., 128B), due to the need for the de-

    vice driver to accomodate for additional data structures,

    and hence we do not experiment with even smaller al-

    locations. Nonetheless, our experiment with 1024-byte

    buffers clearly shows the cause (and potential to remedy)

    the problem of skewed memory load. As we discuss in

    Section 5, we believe this issue could be fixed in a gen-

    eral manner through the use of a modified memory allo-

    cator that allows for variable-size buffer allocations.

    Finally, if our conjecture that this imbalance was

    the performance bottleneck was right then reducing the

    imbalance should translate to higher packet-forwarding

    rates. Happily, using 1024B buffers we do see a 29.5%increase in forwarding rate from 3.4Gbps to 4.4Gbps;

    Figure 13 shows this improvement in terms of the packet

    rate (from 6.4 to 8.2 Mpps).

    Summary of bottleneck analysis The presented ex-

    periments showed what rates are achievable on each

    system component for hand-crafted workloads like our

    stream benchmark. We use these rates as re-calibrated

    12

    34

    12

    34

    0

    5

    10

    rank

    Original Clickw/2048 byte buffers

    bank

    Gbps

    12

    34

    12

    34

    0

    5

    10

    rank

    Modified Clickw/1024 byte buffers

    bank

    Gbps

    Figure 12: Memory load distribution across banks and

    ranks for 64B packets and two different sizes of packet

    buffers.

    0 2 4 6 8 100

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    offered load (Mpps)

    sustainedload(Mpps)

    original click w/2048 byte buffers

    modified click w/1024 byte buffers

    Figure 13: Before-and-after forwarding rates for 64B

    packets and two different sizes of packet buffers.

    8

  • 7/31/2019 Packet Forwarding Capabilities

    9/14

    system attainable load w/ 64B percentage

    component limit (Gbps) router (Gbps) room-to-grow

    1 rank-bank 7.2 (from volatile-int) 7.168 0

    FSB address bus 74 (from stream) 50 48

    aggregate memory 51 (from stream) 33 54

    PCIe 36 (from 1KB pkt tests) 20 80

    FSB data bus 37 (from stream) 9 311

    Table 1: Room for growth on each of the system components computed as the percentage increase in measured usage

    for the 64B packet forwarding workload that can be accommodated before we hit achievable performance limits as

    obtained from specially crafted benchmark tests.

    upper bounds on the performance of each component

    and compare them to the corresponding rates measured

    for the 64-byte packet workload at saturation. To quan-

    tify our observations, we define, for each component,

    the room for growth as the percentage increase in

    usage that could be accommodated on the componentbefore we hit the upper bound. For example, for the

    stream benchmark, we measured 51Gbps of maximum

    aggregate memory bandwidth; for our 64-byte packet

    workload, at saturation, we measured 33Gbps of aggre-

    gate memory bandwidth; thus, ignoring other bottlenecks

    (such as the per rank/bank load), there is room to increase

    memory-bus usage by about 54% ((5133)/33) beforehitting the 51Gbps upper bound. Table 1 summarizes our

    results. We see that, if we can eliminate the problem of

    poor memory allocation (we discuss potential solutions

    in Section 5), then there is room for a fairly substantial

    improvement in the minimum-sized packet forwarding

    rateapproximately 50%. The next section looks foradditional sources of inefficiencythis time due to soft-

    ware overheads.

    3.2 Overhead Analysis

    The previous section treated the system as a black box,

    measuring the load on each bus but making no attempt

    to justify it. We now try to deconstruct the measured

    load into its components, as a way to assess the packet-

    forwarding efficiency of our system.

    First, we adjust the back-of-the-envelope analysis of

    Section 2 to our particular experimental setup and use itto estimate the expected load on each bus: In Section 2,

    we argued that an incoming traffic rate of R bps should

    roughly lead to loads of 2R, 2R,a n d 4R on the FSB, PCIe,

    and memory bus respectively. These numbers were based

    on two assumptions: first, that bus loads are only due to

    moving packets around; second, that the CPU reads and

    updates each incoming packet, thus contributing to FSB

    and memory-bus load. The second assumption does not

    hold in our experiments, because we use static routing,

    1 2 3 4 5 6 70

    2

    4

    6

    8

    10

    12

    Packets/sec (Mpps)

    verhea

    d

    atio

    FSB

    Memory

    Figure 14: Memory and FSB per packet overhead.

    where the CPU does not even need to read packet headers

    to determine the output port. Hence, with the optimistic

    reasoning of Section 2, in our experiments, an incoming

    traffic rate of R bps should roughly result in loads of 0,

    2R, and 2R on the FSB, PCIe, and memory bus, all of

    them due to moving packets from NIC to memory and

    back.

    Not surprisingly, these estimates are below the loads

    that we actually measure, indicating that, beyond moving

    packets around, all three buses incur an extra per-packet

    overhead. We quantify this overhead as the number of

    extra per-packet transactions (i.e., transactions that are

    not due to moving packets between NIC and memory)

    performed on each bus. We compute it as follows:

    measured loadestimated load

    packet rat e transaction size

    Figure 14 plots this number for the FSB and memory

    bus as a function of the packet rate and size; the PCIe

    overhead is simply the difference between the other two.

    So, the FSB and PCIe overheads start around 6, while

    the memory-bus overhead starts around 12; all overheads

    slightly drop as the packet rate increases.

    It turns out that these overheads make sense once we

    consider the transactions for book-keeping socket-buffer

    9

  • 7/31/2019 Packet Forwarding Capabilities

    10/14

    descriptors: For each packet transfer from NIC to mem-

    ory, there are three such transactions on each of the

    FSB and PCIe bus: the NIC updates the corresponding

    socket-buffer descriptor, as well as the descriptor ring

    (two PCIe and memory-bus transactions); the CPU reads

    the updated descriptor, writes a new (empty) descriptor

    to memory, and updates the descriptor ring accordingly(three FSB and memory-bus transactions); finally, the

    NIC reads the new descriptor (one PCIe and memory-

    bus transaction). Each packet transfer from memory to

    NIC involves similar transactions and, hence, descriptor

    book-keeping accounts for the 6 extra per-packet transac-

    tions we measure on the FSB and PCIe busand, hence,

    the 12 extra transactions measured on the memory bus.

    The slight overhead drop as the packet rate increases is

    due to the cache that optimizes the transfer of multi-

    ple (up to four) descriptors with each 64-byte transac-

    tion (each descriptor is 16-bytes long); this optimization

    kicks in more often at higher packet rates.

    We should note that these extra per-packet transac-tions translate into surprisingly high traffic overheads,

    especially for small packets: for 1024-byte packets, 12

    per-packet transactions on the memory bus translate into

    37.5% traffic overhead; for 64-byte packets, this num-

    ber becomes 600%. As we discuss in Section 5, these

    overheads can be reduced by amortizing descriptors over

    multiple packets whenever possible (similar techniques

    are already common in high-speed capture cards).

    4 Inferring Server Potential

    We now apply our findings from the last two sections toanswer the following questions:

    Given our analysis of current-server packet-

    forwarding performance, what can we improve and

    what levels of performance are attainable?

    What packet-forwarding performance should we

    expect from next-generation servers?

    We answer these through extrapolative analysis and leave

    validation to future work.

    4.1 Shared-bus Architectures

    In Section 3.1, we saw that the first packet-forwarding

    bottleneck arises from inefficient use of the memory

    system, in particular, the imbalanced layout of packets

    across memory ranks and banks. The question is, how

    much could we improve performance by fixing this im-

    balance?

    According to our overhead analysis (Section 3.2), per-

    packet overhead on the memory bus does not increase

    with packet rate; hence, if we eliminated the problem-

    atic packet layout, we could increase our forwarding rate

    until we hit the next bottleneck. According to our bottle-

    neck analysis (Section 3.1), that is the FSB address bus,

    and we could increase our forwarding rate by 50% be-

    fore hitting it. Hence, we argue that eliminating the prob-

    lematic layout could increase our minimum-size-packetforwarding rate by 50%, i.e., from 3.4Gbps to approxi-

    mately 5.1Gbps.

    A second area for improvement, identified in Sec-

    tion 3.2, is the use of socket-buffer descriptors and the

    chatty manner in which these are maintained. We now

    estimate how much we could improve performance by

    simply amortizing descriptor transfer across multiple

    packet transfers.

    We start by considering the memory bus. From Sec-

    tion 3.2, we can approximate the load on the memory bus

    as 2 bit rate+ 10 packet rat e transaction size. Werewe to transfer, say, 10 descriptors with a single trans-

    action, that would immediately reduce memory-bus loadto 2 bit rate+ packet rate transaction size; for 64-bytepackets and 64-byte transactions, this corresponds to a

    factor-of-4 reduction. Applying a similar line of reason-

    ing to the FSB and PCIe bus, we can show that, for 64-

    byte packets, descriptor amortization stands to reduce the

    load on each bus by factors of 10 and 2.5. Recall from Ta-

    ble 1 that we had 0%, 50% and 80% room for growth on

    each of the memory, FSB, and PCIe buses, and, hence,

    the load on each of these buses could grow by a factor

    of 4 (41.0), 15 (10150%), and 4.5 (2.5180%) re-

    spectively. Since the maximum improvement that can be

    accommodated on all buses is by a factor of 4, we argue

    that reducing descriptor-related overheads could improve

    our minimum-size-packet forwarding rate from 3.4Gbps

    to 13.6Gbps.

    Finally, combining both optimizations should allow us

    to climb still higher to approximately 20Gbps, though,

    of course, the limited number of network slots on our

    machine would limit us to 16Gbps.

    While the above is an extrapolation (albeit one de-

    rived from empirical observations), it nonetheless points

    to the tremendous untapped potential of current servers.

    Even if our estimates are off by a factor two, it still

    seems possible that current servers can achieve forward-

    ing rates of 10Gbpsa number currently consideredthe realm of specialized (and expensive) equipment. We

    close by noting that the suggested fixes involve only

    modest operating-system rearchitecting.

    4.2 Point-to-point Architectures

    We now apply the results of our measurement study to es-

    timate the packet forwarding rates that might be achiev-

    able with point-to-point (p2p) server architectures as in-

    10

  • 7/31/2019 Packet Forwarding Capabilities

    11/14

    troduced in Section 2 (Figure 2). At times, we make as-

    sumptions where the necessary details of p2p architec-

    tures arent yet known and, in such cases, we explicitly

    note our assumptions as such.

    In a p2p architecture the role of the FSB that of

    carrying traffic between sockets and between CPUs and

    their non-local memory is played by point-to-pointlinks such as Intels QuickPath [5]. In our analysis, we

    assume our findings from the FSB apply to these inter-

    socket buses. Specifically, we assume that the 50% room-

    to-grow that we measured on the FSB applies to these

    inter-socket buses. If anything, this seems like a wildly

    conservative assumption for two reasons: (1) the nomi-

    nal speed of these inter-socket links is 200Gbps [2], com-

    pared to 85Gbps for current FSBs and (2) the operations

    seen on the single FSB are now spread across six inter-

    socket links.

    We compute the expected performance for the p2p

    architecture by considering the different factors that

    will offer a performance improvement relative to theshared-bus server weve studied so far. These factors are:

    (1) reduced per-bus overheads: This improvement

    results simply due to the transition from a shared-bus

    to a peer-to-peer architecture as discussed in Section 2.

    These overheads and the corresponding reduction are

    summarized in the first three columns of Table 2.6

    (2) room-to-grow: As before this records the capacity

    for growth on each bus. For this we use our findings

    from Section 3.1.

    (3) technology improvements: This accounts for the

    standard technology improvements expected in this next-

    generation of servers. We assume a 2x improvement on

    the FSB and PCIe buses by observing that (for exam-

    ple) the Intel QuickPath inter-socket links for use in the

    Nehalem server line supports speeds that are over 2x

    faster than current FSBs. Likewise, the PCIe-2.0 runs 2x

    faster than current PCIe-1.1 and the recently announced

    PCIe-3.0 is to run at 2x the speed of PCIe-2.0 [20] (our

    test server uses PCIe-1.1). We conservatively assume that

    memory technology will not improve.

    Table 2 summarizes these performance factors and

    computes the combined performance improvement thatwe can expect on each system component. As we see,

    the overall performance improvement is still limited by

    memory (both because were assuming the rank-bank

    6Note that, while Section 3.2 revealed that the overheads we see in

    practice arefar higherthan those from ouranalysis, were assuming that

    the relative reduction across architectures will still hold. This appears

    reasonable since this reduction is entirely due to the offered load being

    split across more system components 6 vs. 1 inter-socket bus, 4 vs. 1

    memory bus and 4 vs. 1 PCIe bus.

    Figure 15: Forwarding rates for shared-bus and p2p

    server architectures with and without different optimiza-

    tions.

    imbalance problem remains and that memory technol-

    ogy improves more slowly). Despite this, were left with

    a 4x improvement suggesting that a next-generation p2p

    server running unmodified Linux+click will scale to ap-proximately 13.6Gbps. The additional use of the opti-

    mizations described above could further improve perfor-

    mance to potentially exceed 40Gbps.

    Figure 15 summarizes the various forwarding rates for

    the different architectures and optimizations considered.

    In summary, current shared-bus servers scale to (min-

    sized) packet forwarding rates of 3.4Gbps and we esti-

    mate future p2p servers will scale to 10Gbps. Moreover

    our analysis suggests that modifications to eliminate key

    bottlenecks and overheads stand to improve these rates

    to over 10Gbps and 40Gbps respectively.

    5 Recommendations and Discussion

    5.1 Eliminating the Bottlenecks

    We believe the bottlenecks and overheads identified in

    the previous sections can be addressed through relatively

    modest changes to operating systems and NIC firmware.Unfortunately, the need to modify NIC firmware makes it

    difficult to experiment with these changes. We describe

    these modifications at a high level and note that these

    modifications do not impact the programmability of the

    system.

    Improved memory allocators. Recall that our results

    in Section 3.1 suggest that the imbalance in memory ac-

    cesses with regard to (skb) packet buffers in the kernel

    11

  • 7/31/2019 Packet Forwarding Capabilities

    12/14

    system shared-bus p2p gain from room-to gain w/ overall

    bus overheads overheads reduced -grow tech gain

    (section 2) (section 2) overheads (Table 1) trends

    memory 4R R 4x 1.0x 1.0x 4x

    FSB/CSI 2R 2R/3 3x 1.5x 2x 9x

    memory 2R R/2 4x 1.8x 2x 14.4x

    Table 2: Computing the performance improvement with a p2p server architecture. R denotes the line rate in bits/second.

    occurs because the kernel allocates all packet buffers to

    be a single size with a default of 2KB. This problem can

    be addressed by simply creating packet buffers of various

    sizes (e.g., 64B, 256B, 1024B and 2048B) and allocating

    a packet to the buffer appropriate for its size. This can

    be implemented by simply creating multiple descriptor

    rings, one for each buffer size; on receiving an incom-

    ing packet, the NIC simply uses the descriptor ring ap-

    propriate to the size of the received packet. While more

    wasteful of system memory, this isnt an issue since the

    memory requirements for a router workload are a small

    fraction of the available server memory. This approach is

    in fact inspired by similar approaches in hardware routers

    that pre-divide memory space into separate regions for

    use by packets of different sizes [13].

    The imbalance due to packet descriptors can be like-

    wise tackled by arranging for packet descriptors to con-

    sume a greater portion of the memory space by, for ex-

    ample, using larger descriptor rings and/or multiple de-

    scriptor rings. Conveniently however, the use of amor-

    tized packet descriptors as described below would also

    have the effect of greatly reducing the descriptor-relatedtraffic to memory and hence implementing amortized de-

    scriptors might suffice to reduce this problem.

    Amortizing packet descriptors Section 3.2 reveals

    that handling packet descriptors imposes an inordinate

    per-packet overhead particularly for small packet sizes.

    As alluded to earlier, a simple strategy is to have a single

    descriptor summarize multiple up to a parameter k

    packets. This amortization is similar to what is already

    implemented on capture cards designed for special-

    ized monitoring equipment. Such amortization is easily

    accommodated for k smaller than the amount of packetbuffer memory already on the NIC. Since we imagine

    that kcan be a fairly small number ( 10) and since cur-

    rent NICs already have buffer capacity for a fair num-

    ber of packets (e.g., our cards have room for 64 full-

    sized packets), such amortization should not increase the

    storage requirements on NICs. Amortization can how-

    ever impose increased delay. This can be controlled by

    having a timeout that regulates the maximum time pe-

    riod the NIC can wait to transfer packets. Setting to

    a small multiple (e.g., 2k times) the reception time for

    small packets should be an acceptable delay penalty.

    5.2 Discussion

    When we set out to study the forwarding performance

    of commodity servers, we already expected the mem-ory system to be the bottleneck; the fact, however, that

    the bottleneck was due to an unfortunate combination of

    packet layout and memory-chip organization came as a

    surprise. While trying to figure this out, we looked at how

    the kernel allocates memory for its structures; not sur-

    prisingly, it favors adjacent memory addresses to lever-

    age caching. However, given that the kernel uses physical

    addresses, nearby addresses often correspond to physi-

    cally nearby locations that fall on the same memory rank

    and bank. As a result, workloads that cannot benefit from

    caching may end up hitting the same memory rank/bank

    pairs and, hence, be unable to benefit from aggregate

    memory bandwidth either. In short, when combined withan unfortunate data layout, locality can hurt rather than

    help.

    Another surprise was the lack of literature on the be-

    havior and performance of system components outside

    the CPUs. The increasing processor speeds and the rise

    of multi-processor systems mean that, from now on,

    processing data is less likely to be the bottleneck than

    moving it around between CPUs and other I/O devices.

    Hence, it is important to be able to measure and under-

    stand system performance beyond the CPUs.

    Finally, we were surprised by the lack of efficiency

    in moving data between system components. In manycases, data is unnecessarily transferred to memory (con-

    tributing to memory-bus load) when it could be directly

    transferred from the NIC to the appropriate CPU cache.

    Packet forwarding and processing workloads would ben-

    efit significantly from techniques along the lines of Di-

    rect Cache Access (DCA), where the memory controller

    directly places incoming packets into the right CPU

    cache by snooping the DMA transfer from NIC to mem-

    ory [17].

    12

  • 7/31/2019 Packet Forwarding Capabilities

    13/14

    6 Related and Future Work

    The use of a software router based on general-purpose

    hardware and operating systems is not a new one. In fact,

    the 13 NSFNET NSS (nodal switching subsystems) in-

    cluded 9 systems running Berkeley UNIX interconnected

    with a 4Mb/s IBM token ring. Click [18] and Scout [22]explored the question of how to architect router software

    for improved programmability and extensibility; SMP

    Click [12] extends the early Click architecture to better

    exploit the performance potential of multiprocessor PCs.

    These efforts focused primarily on designing the soft-

    ware architecture for packet processing and, while they

    do report on the performance of their systems, this is at

    a fairly high-level using purely black-box evaluation. By

    contrast, our work assumes Clicks software architecture

    but delves under the covers (of both hardware and soft-

    ware) to understand why performance is limited and how

    these limitations carry over to future server architectures.

    As a slight digression: it is somewhat interesting tonote the role of time on the performance of these (fairly

    similar) software routers. The early NSF nodes achieved

    forwarding rates of 1K packets/sec (circa 1986), Click

    (at SOSP99) reported a maximum forwarding rate of

    330Kpps which SMP-click improves to 494Kpps (2001);

    we find unmodified Click achieves about 6.5Mpps. This

    is of course somewhat anecdotal since were not nec-

    essarily comparing the same click configurations but

    nonetheless suggests the general trajectory.

    There is an extensive body of work on benchmarking

    various application workloads on general-purpose pro-

    cessors. The vast majority of this work is in the context

    of computation-centric workloads and benchmarks such

    as TPC-C. Closer to our interest in packet processing,

    are efforts similar to those of Veal et al. [24] that look

    for the bottlenecks in server-like workloads that involve

    a fair load of TCP termination. Their analysis reveals

    that such workloads are bottlenecked on the FSB address

    bus. A similar conclusion has been arrived at for several,

    more traditional, workloads. (We refer the reader to [24]

    for additional references to the literature on such evalu-

    ations.) As our results indicate, the bottleneck to packet

    processing lies elsewhere.

    There is similarly a large body of work on packet

    processing using specialized hardware (e.g., see [19]and the references therein). Most recently, Turner et

    al. describe a Supercharged Planetlab Platform [23]

    for high-performance overlays that combines general-

    purpose servers with network processors (for slow and

    fast path processing respectively); they achieve forward-

    ing rates of up to 5Gbps for 130B packets. We focus in-

    stead on general-purpose processors and our results sug-

    gest that these offer performance that is competitive.

    Closest to our work is a recent independent effort by

    Egi et al. [15]. Motivated by the goal of building high-

    performance virtualized routers on commodity hardware,

    the authors undertake a measurement study to under-

    stand the performance limitations of modern PCs. They

    observe similar performance and, like us, arrive at the

    conclusion that something is amiss at the memory sys-

    tem. Through inference based on black-box testing theauthors suggest that non-contiguous memory writes ini-

    tiated by the PCIe controller is the likely culprit. Our ac-

    cess to chipset tools allows us to probe the internals of

    the memory system and our findings there lead us to a

    somewhat different conclusion.

    Finally, our work also builds on a recent position paper

    making the case for cluster-based software routers [11];

    the paper identifies the need to scale servers to line rate

    but doesnt explore the issue of bottlenecks and perfor-

    mance in any detail.

    In terms of future work, we plan to extend our work

    along three main directions. First, were exploring the

    possibility of implementing the modified descriptor andbuffer allocator schemes described above. Second, we

    hope to repeat our analysis on the Nehalem server plat-

    forms once available [8]. Finally, were currently work-

    ing to build a cluster-based router prototype as described

    in earlier work [11] and hope to leverage our findings

    here to both evaluate and improve our prototype.

    7 Conclusion

    A long-held and widespread perception has been that

    general-purpose processors are incapable of high-speed

    packet forwarding motivating an entire industry aroundthe development of specialized (and often expensive)

    network equipment. Likewise, the barrier to scalability

    has been variously attributed to limitations in I/O, mem-

    ory throughput and various other factors. While these no-

    tions might each have been true at various points in time,

    modern PC technology evolves rapidly and hence it is

    important that we calibrate our perceptions by the current

    state of technology. In this paper, we revisit old questions

    about the scalability of in-software packet processing in

    the context of current and emerging off-the-shelf server

    technology. Another, perhaps more important, contribu-

    tion of our work is to offer concrete data on questions that

    have often been answered through anecdotal or indirectexperience.

    Our results suggest that particularly with a little care

    modern server platforms do in fact hold the poten-

    tial to scale to the high rates typically associated with

    specialized network equipment and that emerging tech-

    nology trends (multicore, NUMA-like memory architec-

    tures, etc.) should only further improve this scalability.

    We hope that our results, taken together with the grow-

    ing need for more flexible network infrastructure, will

    13

  • 7/31/2019 Packet Forwarding Capabilities

    14/14

    spur further exploration into the role of commodity PC

    hardware/software in building future networks.

    References

    [1] Intel 10 Gigabit XF SR Server Adapters. http:

    //www.intel.com/network/connectivity/products/10gbexfsrserveradapter.htm .

    [2] Intel QuickPath Interconnect. http://en.wikipedia.

    org/wiki/Intel_QuickPath_Interconnect .

    [3] Intel Xeon Processor 5000 Sequence. http://www.intel.

    com/products/processor/xeon5000 .

    [4] NetFPGA. http://yuba.stanford.edu/NetFPGA/ .

    [5] Next-Generation Intel Microarchitecture. http://www.

    intel.com/technology/architecture-silicon/

    next-gen.

    [6] Vyatta: Open Source Networking. http://www.vyatta.

    com/products/.

    [7] Cisco Opening Up IOS. http://www.networkworld.

    com/news/2007/121207-cisco-ios.html , Dec. 2007.

    [8] Intel Demonstrates Industrys First 32nm Chip and Next-

    Generation Nehalem Microprocessor Architecture. Intel

    News Release., Sept. 2007. http://www.intel.com/

    pressroom/archive/releases/20070918corp_a.

    htm.

    [9] Juniper Open IP Solution Development Program.

    http://http://www.juniper.net/company/

    presscenter/pr/2007/pr-071210.html , 2007.

    [10] Intel Corporations Multicore Architecture Briefing, Mar.

    2008. http://www.intel.com/pressroom/archive/

    releases/20080317fact.htm .

    [11] K. Argyraki et al. Can software routers scale? In ACM Sig-

    comm Workshop on Programmable Routers for Extensible Ser-

    vices, Aug. 2008.

    [12] B. Chen and R. Morris. Flexible control of parallelism in a multi-

    procesor pc router. In Proc. of the USENIX Technical Conference,

    June 2001.

    [13] I. Cisco Systems. Introduction to Cisco IOS Software. http:

    //www.ciscopress.com/articles/ .

    [14] D. Comer. Network Processors. http://www.cisco.com/

    web/about/ac123/ac147/archived_issues/ipj_

    7-4/network_processors.html .

    [15] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, and L. Mathy.

    Towards performant virtual routers on commodity hardware.

    Technical Report Research Note RN/08/XX, University College

    London, Lancaster University, May 2008.

    [16] R. Ennals, R. Sharp, and A. Mycroft. Task partitioning for multi-

    core network processors. In Proc. of International Conference on

    Compiler Construction, 2005.

    [17] R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for

    High Bandwidth Network I/O. In Proc. of ISCA, 2005.

    [18] E. Kohler, R. Morris, B. Chen, J. Jannotti, and F. Kaashoek. The

    click modular router. ACM Transactions on Computer Systems,

    18(3):263297, Aug. 2000.

    [19] J. Mudigonda, H. Vin, and S. W. Keckler. Reconciling perfor-

    mance and programmability in networking systems. In Proc. ofSIGCOMM, 2007.

    [20] PIC-SIG. PCI Express Base 2.0 Specification, 2007.

    http://www.pcisig.com/specifications/pciexpress/base2.

    [21] A. Sigcomm. Workshop on Programmable Routers for Extensible

    Services. http://www.sigcomm.org/sigcomm2008/

    workshops/presto/, 2008.

    [22] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a

    Robust Software-Based Router Using Network Processors. In

    Proc. of the 18th ACM SOSP, 2001.

    [23] J. Turner et al. Supercharging planetlab a high performance,

    multi-application, overlay network platform. In Proc. of SIG-

    COMM, 2007.

    [24] B. Veal and A. Foong. Performance scalability of a multi-core

    web server. In Proc. of ACM ANCS, Dec. 2007.

    14