performance optimization of tcp/ip over 10 gigabit ... · pdf fileon the other hand, we can...

12
Performance Optimization of TCP/IP over 10 Gigabit Ethernet by Precise Instrumentation Takeshi Yoshino , Yutaka Sugawara , Katsushi Inagami , Junji Tamatsukuri , Mary Inaba and Kei Hiraki Google Japan Inc. [email protected] The University of Tokyo {sugawara,inagami,junji,mary,hiraki}@is.s.u-tokyo.ac.jp Abstract End-to-end communications on 10 Gigabit Ethernet (10GbE) WAN became popular. However, there are difficulties that need to be solved before utilizing Long Fat-pipe Networks (LFNs) by using TCP. We observed that the followings caused performance depression: short- term bursty data transfer, mismatch between TCP and hardware support, and excess CPU load. In this research, we have established systematic methodologies to optimize TCP on LFNs. In order to pinpoint causes of performance depression, we analyzed real net- works precisely by using our hardware-based wire-rate analyzer with 100-ns time-resolution. We took the following actions on the basis of the observations: (1) utilizing hardware-based pacing to avoid unnecessary packet losses due to collisions at bottlenecks, (2) modifying TCP to adapt packet coalescing mechanism, (3) modifying programs to reduce memory copies. We have achieved a constant through-put of 9.08Gbps on a 500ms RTT network for 5h. Our approach has overcome the difficulties on single-end 10GbE LFNs. 1. Introduction With the rapid progress in network technologies, the capacities of backbone networks is increasing rapidly. Technologies such as WAN PHY for 10 Gigabit Ethernet (10GbE) have brought us closer to realizing long-distance large-bandwidth networks. Such networks are called long fat-pipe networks (LFNs). Currently, there are 10GbE WAN PHY paths across the oceans, connected with each other through 10GbE L2/L3 switches. On the other hand, we can also use LFNs for end-to-end communi- cations. 10GbE network interface cards (NICs) have become popular. They provide a capacity of approximately 9 Gbps between commod- ity PCs at distant locations, for example, Tokyo and Amsterdam. TCP/IP is the widely-used standard protocol for end-to-end reliable communication [7], [12], [11], [13], [3]. TCP ensures that packets are delivered to the destination and that the delivered packets are not broken. TCP detects bottlenecks along the path to the destination and tries to avoid them. Since TCP is a mature technology, we cannot easily replace it with other protocols. However, it is well-known that it is difficult to obtain high perfor- mance over LFNs by using TCP. The performance of TCP commu- nications can be less than 1 Gbps when we simply connect hosts to 10GbE LFNs using 10GbE NICs. One reason is the congestion control mechanism of TCP. The sender host has to store data in a buffer for when it has to retransmit the lost packet until the corresponding ACKs are returned. The buffer is called the congestion window. The data that has already been sent and not yet been acknowledged is called in-flight data. The amount of in-flight data is limited by the congestion window size. TCP realizes the congestion control by changing this size. When the congestion window size is small, the transmission rate will be low because the size of in-flight data is limited. For efficient communication, the congestion window size should be greater than the bandwidth-delay product (BDP). The term delay here means the round-trip time (RTT). The RTT will be large on LFNs and, therefore, the BDP will be large. On the other hand, the congestion window size increases at a rate that is inversely proportional to the RTT. Larger RTT reduces the utilization at the start-up phase and the loss recovery phase, when TCP resets the congestion window size to a small value. In order to solve the issue, various improvement for the congestion control mechanism have been proposed; however, there is no perfect solution. If it is tuned manually for Gigabit Ethernet (GbE), it will not fit 10GbE. The same holds for the RTT. TCPs not designed for 10GbE will stick to low utilization or cause frequent packet losses. The congestion window consumes a large amount of memory and the access pattern for it has no locality. This reduces the cache performance, which then affects the overall performance. It is known that packet losses easily occur on TCP communications over LFNs even though the available bandwidth is considerably larger than the actual throughput. On GbE, we found that intermittent bursty data transmissions which occurred with a period of RTT induce packet losses even when there is no congestion [18], [14]. In order to solve this problem, we proposed a packet pacing method that produces good performance. With respect to 10GbE, we found that packet pacing is again very helpful; however we could not obtain good performance levels such as 90% utilization of the bandwidth only by packet pacing. Irrespective of how carefully we set the pacing for microscopic data transfer rates, the throughput hit a ceiling that was considerably lower than the bandwidth. We observed the anomaly carefully and found that the issue was caused not only by a single bottleneck but also by a combination of bottlenecks in several areas such as a host interface, CPU load, interrupts, memory access, and the network. Networks have become almost as fast as CPUs and memory today. The situation is drastically different from that in the GbE era. In this paper, we present a systematic approach to achieve efficient utilization of 10GbE LFNs by using TCP. Initially, we addressed the problems with TCP over 10GbE LFNs by an inductive approach. That is, we analyzed the problems by observing real data transfer over 10GbE LFNs. Since various factors in PCs, NICs, switches, and the network affect the performance, we Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2008 November 2008, Austin, Texas, USA 978-1-4244-2835-9/08 $25.00 ©2008 IEEE

Upload: duonghanh

Post on 09-Mar-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Performance Optimization of TCP/IPover 10 Gigabit Ethernet by Precise Instrumentation

Takeshi Yoshino∗, Yutaka Sugawara†, Katsushi Inagami†, Junji Tamatsukuri†, Mary Inaba† and Kei Hiraki†∗Google Japan Inc.

[email protected]†The University of Tokyo

{sugawara,inagami,junji,mary,hiraki}@is.s.u-tokyo.ac.jp

Abstract

End-to-end communications on 10 Gigabit Ethernet (10GbE) WANbecame popular. However, there are difficulties that need to be solvedbefore utilizing Long Fat-pipe Networks (LFNs) by using TCP. Weobserved that the followings caused performance depression: short-term bursty data transfer, mismatch between TCP and hardwaresupport, and excess CPU load. In this research, we have establishedsystematic methodologies to optimize TCP on LFNs. In order topinpoint causes of performance depression, we analyzed real net-works precisely by using our hardware-based wire-rate analyzerwith 100-ns time-resolution. We took the following actions on thebasis of the observations: (1) utilizing hardware-based pacing toavoid unnecessary packet losses due to collisions at bottlenecks, (2)modifying TCP to adapt packet coalescing mechanism, (3) modifyingprograms to reduce memory copies. We have achieved a constantthrough-put of 9.08Gbps on a 500ms RTT network for 5h. Ourapproach has overcome the difficulties on single-end 10GbE LFNs.

1. Introduction

With the rapid progress in network technologies, the capacities ofbackbone networks is increasing rapidly. Technologies such as WANPHY for 10 Gigabit Ethernet (10GbE) have brought us closer torealizing long-distance large-bandwidth networks. Such networks arecalled long fat-pipe networks (LFNs). Currently, there are 10GbEWAN PHY paths across the oceans, connected with each otherthrough 10GbE L2/L3 switches.

On the other hand, we can also use LFNs for end-to-end communi-cations. 10GbE network interface cards (NICs) have become popular.They provide a capacity of approximately 9 Gbps between commod-ity PCs at distant locations, for example, Tokyo and Amsterdam.

TCP/IP is the widely-used standard protocol for end-to-end reliablecommunication [7], [12], [11], [13], [3]. TCP ensures that packetsare delivered to the destination and that the delivered packets are notbroken. TCP detects bottlenecks along the path to the destination andtries to avoid them. Since TCP is a mature technology, we cannoteasily replace it with other protocols.

However, it is well-known that it is difficult to obtain high perfor-mance over LFNs by using TCP. The performance of TCP commu-nications can be less than 1 Gbps when we simply connect hosts to10GbE LFNs using 10GbE NICs.

One reason is the congestion control mechanism of TCP. The senderhost has to store data in a buffer for when it has to retransmit the

lost packet until the corresponding ACKs are returned. The buffer iscalled the congestion window. The data that has already been sentand not yet been acknowledged is called in-flight data. The amount ofin-flight data is limited by the congestion window size. TCP realizesthe congestion control by changing this size. When the congestionwindow size is small, the transmission rate will be low because thesize of in-flight data is limited. For efficient communication, thecongestion window size should be greater than the bandwidth-delayproduct (BDP). The term delay here means the round-trip time (RTT).The RTT will be large on LFNs and, therefore, the BDP will be large.On the other hand, the congestion window size increases at a ratethat is inversely proportional to the RTT. Larger RTT reduces theutilization at the start-up phase and the loss recovery phase, whenTCP resets the congestion window size to a small value.

In order to solve the issue, various improvement for the congestioncontrol mechanism have been proposed; however, there is no perfectsolution. If it is tuned manually for Gigabit Ethernet (GbE), it willnot fit 10GbE. The same holds for the RTT. TCPs not designed for10GbE will stick to low utilization or cause frequent packet losses.

The congestion window consumes a large amount of memory andthe access pattern for it has no locality. This reduces the cacheperformance, which then affects the overall performance.

It is known that packet losses easily occur on TCP communicationsover LFNs even though the available bandwidth is considerably largerthan the actual throughput. On GbE, we found that intermittent burstydata transmissions which occurred with a period of RTT inducepacket losses even when there is no congestion [18], [14]. In orderto solve this problem, we proposed a packet pacing method thatproduces good performance. With respect to 10GbE, we found thatpacket pacing is again very helpful; however we could not obtaingood performance levels such as 90% utilization of the bandwidthonly by packet pacing. Irrespective of how carefully we set the pacingfor microscopic data transfer rates, the throughput hit a ceiling thatwas considerably lower than the bandwidth. We observed the anomalycarefully and found that the issue was caused not only by a singlebottleneck but also by a combination of bottlenecks in several areassuch as a host interface, CPU load, interrupts, memory access, and thenetwork. Networks have become almost as fast as CPUs and memorytoday. The situation is drastically different from that in the GbE era.

In this paper, we present a systematic approach to achieve efficientutilization of 10GbE LFNs by using TCP.

Initially, we addressed the problems with TCP over 10GbE LFNsby an inductive approach. That is, we analyzed the problems byobserving real data transfer over 10GbE LFNs. Since various factorsin PCs, NICs, switches, and the network affect the performance, we

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2008 November 2008, Austin, Texas, USA 978-1-4244-2835-9/08 $25.00 ©2008 IEEE

had to eliminate the effects of the factors that we did not wish toobserve, so that we could analyze and discuss individual problemsseparately. In addition to real LFNs, we have performed experimentson a pseudo LFN by using a network emulator to observe only theeffect of the RTT. We have employed and developed methods thatenable us to observe the factors separately. We have designed andimplemented a precise packet analyzer called Traffic Analysis PreciseEnhancement Engine (TAPEE) to analyze and clarify the issues. Wehave observed TCP communications over LFNs precisely by usingthese analysis tools, and have categorized the issues systematically.

Then, we have listed the pros and cons of the the existing methodsin order to appropriately tune them. We selected the method thatachieved the lowest possible CPU utilization with the smallestnumber of special equipment on the basis of our experiments anddiscussion. We propose modifications to the existing methods tocompensate for the weaknesses.

This paper is organized as follows: In section 2, we categorizethe problems with TCP on LFNs. In sections 3 through 7, weintroduce TAPEE for detailed analysis and examine the effect ofvarious optimizing methods for efficient TCP communication overLFNs. Then, we show the approach and result of our trial on theInternet2 Land Speed Record (I2-LSR) as an example. We discussthe related works in section 8 and conclude in section 9.

2. Categorizing Issues of TCP over LFNs and ExistingSolutions

Since the speeds of networks have increased to nearly the speed ofmemory, the parameters and factors that that must be considered forend-to-end communications are drastically different from those inthe GbE era. We have categorized the factors on current computersystems and LFNs that influence to TCP communications. We havecategorized the existing methods to address these issues and haveevaluated how each method fits 10GbE in the later sections.

2.1. LAN PHY and WAN PHY

There are two families, 10GBASE-R and 10GBASE-W, in the10GbE specification for a single pair of optical fibers. The PHY of10GBASE-W is called WAN PHY, and the PHY of 10GBASE-R iscalled LAN PHY. 10GBASE-R has three layers in the PHY while10GBASE-W has a WAN interface sublayer (WIS) in addition to thethree layers. WIS encapsulates Ethernet frames into SONET/SDHframes in order to transmit them on OC-192c/STM-64c. This functionenables 10GbE to interoperate with SONET/SDH equipment moreeasily than by using Packet over SONET/SDH (POS) technology [9].Most intercontinental LFNs include some WAN PHY paths.

WAN PHY networks have the capacity to transmit the Ethernet dataat 9.286 Gbps, while 10GBASE-R supports 10.000 Gbps. This isbecause of the capacity of OC-192c, the SONET/SDH encapsulationoverhead, and interframe stretching [10]. We describe this bottleneckas the WAN PHY bottleneck.

Normally, LAN PHY equipments are used for edge networks. There-fore, most intercontinental LFNs have bottleneck points at the WANPHY–LAN PHY conversion points. When a sender transmits burstydata, the transfer rate hits 10 Gbps on the microscopic scale. Thiscan happen easily in TCP communications because TCP sends data

at the maximum rate until the size of the in-flight data reaches thecongestion window size. We can observe the bursty data by drawing1-ms or more precise-scale graphs. This induces buffer overflow atthe bottleneck points and, therefore, induces packet losses.

And even if the sender controls the rate precisely, intermediateswitches may change the interval between packets. After passingswitches, Inter Packet Gaps (IPGs) can be changed by queueing, andbursty transfer occurs. We should take into account these effects.

2.2. CPU Load and Memory Load

TCP uses a variable called window size to limit the amount of in-flight data. The window size must be set to a sufficiently large valuein order to fill the capacity for a large bandwidth and a large delay.The value is the bandwidth-delay product (BDP).

When we use a large window, TCP demands a large amount of mem-ory for a buffer called the retransmission queue. The retransmissionqueue maintains the transmitted data until ACKs arrive. For example,TCP requires a 125 MB retransmission queue for a bandwidth of 10Gbps and an RTT of 100 ms. We set sysctl values for the buffersize to a large value in order to provide sufficient memory [1], [15].We also set the advertized window size to a large value through thesocket options.

The memory for the retransmission queue is accessed with no locality,which reduces the cache performance and introduces additionaldemands on the memory bandwidth.

Memory copy is performed when the TCP stack creates packets andthe communication data are passed between the user application andthe kernel. Current computer systems have a memory bandwidth ofapproximately 10 GB/s. For example, four-channel DDR2-667 FB-DIMMs connected to an Intel 5000P chipset have a read speed of 21.4GB/s and a write speed of 10.7 GB/s, while PCI-Express x8 achievestransfer rates of 2 GB/s. However, the speed of memory copy by theCPU is limited to approximately 4 GB/s. Therefore, copying data at1.25 GB/s is a heavy load for the hosts. We tried to reduce memorycopy between the user application and the kernel by utilizing systemcalls for zerocopy operation.

When there is a heavy load on the CPU, packets can be lost because ofbuffer overflow, interrupt handling failure, and so on. Since computershave roles other than communication, the CPU utilization for TCPmust be minimized. The causes of CPU load are, for example,calculation of checksum, maintenance of state information, memorycopy, and communication with the NIC. Currently, TCP/IPv4 com-munication is supported by various hardware offloading technologiesin order to reduce the CPU utilization; offloading is effective butexpensive.

Receive and transmission check-summing is a function that offloadsthe calculation of checksum fields. Scatter gather I/O is a method thatallows a NIC to use fragments of data in memory for transmissioninstead of creating a single chunk of data before transmission. Inorder to reduce the number of memory copies, scatter gather I/O isessential.

TCP Segmentation Offload (TSO) and Generic Segmentation Offload(GSO) [24] are technologies in which hardware replaces software inthe jobs of splitting data into individual packets, generating headers,calculating checksum, and putting them together to create packets.Hardware can calculate checksum more effectively than software.

Large send offload (LSO) means splitting a TCP packet that is largerthan the maximum transmission size (MTU) into smaller packets byhardware. LSO is done together with TSO when the TSO is enabled.Using a large amount of data per packet reduces overheads.

Large receive offload (LRO) concatenates multiple TCP packets into asingle TCP packet by using hardware before the packets are deliveredto the kernel. Packets are delivered to the application in a larger unit.This reduces the overheads for receiving data. The kernel TCP stackrecognizes these concatenated packets as a single packet. This mayaffect the window scaling.

TCP Offload Engine (TOE) is technology to handle all processingrelated to TCP by hardware. The effect of TOE is a reduction ofprocessing and a reduction of CPU utilization for copying sourcedata to the transmission queue. TOE was effective when the CPUwas slow; however, it has the disadvantage of cost and flexibilitysuch as a limited choice of the congestion control algorithms.

Delayed ACK is a method to reduce the number of ACKs: Thereceiver host transmits a single ACK to acknowledge the arrival ofmultiple packets [8]. It is employed in large-bandwidth networks toreduce interrupts. Currently, TCP on Linux delays ACK transmissionup to 1 ACK per 2 data packets under the standard settings becausean excessively delayed ACK may compromise congestion control.

Packet coalescing is a technology used to reduce interrupt processingfor received packets. For example, when transmitting 9150-octetpayloads at 8.94 Gbps, packets will be input at 8.94 Gbps / 9150octet = 122 kpps (approx) when there is no delayed ACK. Thiscauses a large load on the CPU. The NIC’s packet coalescing functionsuppresses interrupts and lets the kernel collect multiple input packetsfrom the NIC at only one interrupt. When the packet coalescingparameter is too aggressive, packets collected into the receive bufferof the NIC overflow, causing packet losses.

We found major limitations to packet coalescing in our experiments.When we use packet coalescing and LRO, the scaling of TCP wasretarded drastically. We address this issue in the later section.

2.3. Congestion Control Algorithm

The window size is the smaller value of the advertised window size(receiver window size) and the congestion window size. The conges-tion window size is changed by the congestion control algorithms. Itis increased on the arrival of ACKs, and is decreased at timeouts andthe arrival of duplicate ACKs, which indicate packet loss. A largeRTT retards the growth of the congestion window size. A very lowscaling speed decreases bandwidth utilization at the slow-start phaseand the recovery phase.

When congestion occurs in LFNs, more packets are affected and canbe lost as compared to when congestion occurs in networks with alower BDP. Currently, most 10GbE switches and NICs accept packetsof up to approximately 9200 octets. This value is only about six timeslarger than that of the 1500-octet standard frame used in the FastEthernet. In order to identify the bottleneck points correctly withoutlosing many packets, TCP on 10GbE must scale more moderatelythan that on GbE and Fast Ethernet.

In contrast to the original TCPs (Tahoe, Reno, and NewReno), newTCP congestion control algorithms are proposed and implementedto adapt to faster networks such as GbE. The congestion controlalgorithm will be affected by environmental parameters such as the

Optical tap

TGNLE-1

Logging host

Logging hostSwitch

Port0 Rx Port1 Rx

Port0 TxPort1 Tx

Control hostUSB

Figure 1. Architecture of TAPEE

degree of delay in the ACKs, the maximum transmission unit (MTU),and the ratio of the link speed of the network to the MTU. When usingthese TCPs on 10GbE with a large delay, their behaviors differ fromthat of the original mathematical model because of the large numberof packets. We focus on BIC-TCP [25], [5], which is the defaultalgorithm in Linux 2.6.18, which is currently the most commonamong enterprise Linux distributions.

3. Traffic Analysis Precise Enhancement Engine

Traffic Analysis Precise Enhancement Engine (TAPEE) is a packet-capturing system based on the coordination of special hardwaresand commodity PCs. TAPEE consists of an optical tap, a specialhardware TGNLE-1 [21], commodity PCs, and a 10GbE networkswitch (if required) as shown in Figure 1. The optical tap splits lighton the objective line to TGNLE-1 and the original destination. Afunction which supports packet capturing is programmed on TGNLE-1. TGNLE-1 processes incoming packets and outputs them to PCs.The PCs log and analyze packets. We have to insert the switch formedia conversion when the NICs on the PCs are equipped with10GBASE-SR because TGNLE-1 accepts only 10GBASE-LR.

The special hardware is essential for recognizing the issues ac-curately. Software-based approaches are insufficient because it haslower time granularity. Software-based analyzers will be affected byqueuing, interrupt, other processes, other devices etc. on the pathfrom NICs. These effects mask phenomena related to the issues thatwe want to analyze precisely.

TGNLE-1 is an FPGA-based reconfigurable packet processing equip-ment for 10GbE. Figure 2 shows the components of TGNLE-1.TGNLE-1 has two FPGAs (Xilinx XC2VP50) connected to 10GbEinterfaces. Packets received from port0 Rx go to FPGA-0, and arethen processed in FPGA-0 by the user-implemented function. Then,the packets output from FPGA-0 will be transmitted from port1 Tx.This flow is indicated by the red arrows in the figure. The oppositeflow is indicated by the blue arrows. Users can configure parametersin FPGAs via the control host. These components are implementedin a 1U chassis.

Figure 3 shows how TAPEE works conceptually. TGNLE-1 clips theheaders of received packets and places precise timestamps with aresolution of 100 ns on them. The clipped headers are buffered untilthe number of buffered headers reaches the specified value. Then,headers are repacked as a single packet, transmitted to a logginghost, and logged to the HDDs. Clipping and repacking reduce theload on the logging hosts, such as the amount of data transfer andthe number of interrupts. These supports enable a long-term precise

Tx TxRxRx

To ControlHost

Port010GBASE-LR

Port110GBASE-LR

RAM (2GB)DDR266 SDRAM

RAM (2GB)DDR266 SDRAMOptical

Transceiver300pin MSA

OpticalTransceiver300pin MSA

MACPHY

MACPHY

FPGA-0XC2VP50-5

FPGA-1XC2VP50-5

USB I/F

Figure 2. TGNLE-1 block diagram

Head

Tag a timestamp

Stack, Pack

Repacked headsReceived frame

Transmit

as 1 frameClip

Figure 3. Conceptual diagram of TAPEE

analysis of high-speed communications.

After a transfer experiment, headers are parsed on logging hosts,tallied by our software, and visualized. Users can analyze packetsfreely and flexibly by using their own software on the logging hosts.We can observe various behavioral characteristics of TCP by plottingthe transition of the rate and the values in header fields.

We used IBM eServer x345 servers and HP ProLiant DL145 G2servers with Linux for logging. They have dual Intel Xeon processorsand 2 GB of memory. We used Xilinx ISE 6.3.03i CAD and VerilogHDL for implementation. The system clock in the FPGAs is 133MHz. The function consumed 7545 slices of the total 23616 slices.

We will present the results of our analysis of TCP data transfer onLFNs obtained by using TAPEE in later sections. By using TAPEE,we observed, visualized, and clarified the behavior of packets.

We wrote a program to process the log and have drawn graphs. SeeFigure 4 for example. The x-coordinate value is a raw timestampvalue in seconds. Therefore, the plots start from different positions.The blue plots are 1-ms-scale plots of the throughput of data packets.The green plots are 1000-ms plots of the scale throughput of datapackets. The gray plots are 1-ms-scale plot of the rate of the advancein the acknowledgment number field of TCP. It does not plot therate at which ACK packets are sent but the amount of data thatis acknowledged by the ACKs. The red lines show the points whereduplicate ACKs were observed. The heights of the red lines representthe number of duplications. This indicates the occurence of packetlosses between TAPEE and the receiver host or that the receiver hosthas dropped packets by itself. You can see a orange triangle aroundthe point (40, 2.5) in Figure 4. The orange triangle shows the point

where the sequence field value became discontinuous. This meansthat the difference between the sequence field values of the currentpacket and the previously observed packet did not equal the sizeof the previously observed packet. This indicates the occurrence ofpacket losses between TAPEE and the sender host. You can see apink triangle right below the point (80, 0.0) in Figure 4. The pinktriangle shows the point where the sequence field was rewinded to asmaller value. This indicates the occurrence of retransmission. Thevertical coordinates of the orange and pink triangles shows the degreeof leap and rewind on a certain scale.

We adopt the moving average, which is often used in finance. Forexample, a point (t, y) on the blue plot indicates that a packet passedthe observation point at the time of t and the total amount of data inthe packets that have passed between t − 1 ms and t divided by 1ms is y.

Note that 1-ms-scale throughput plot goes up and down between 10Gbps and 0 Gbps, while the 1000-ms-scale throughput plot is stablearound 1 Gbps and 2 Gbps. By using precise analysis such as this,we can observe microscopic phenomenon.

4. Packet Pacing

In order to avoid packet losses, which are induced by the WAN PHYbottlenecks of 9.286 Gbps, we have to use some methods in additionto TCP. Ethernet does not have a precise rate control function. Whendata arrives from the kernel’s TCP stack, the NIC transmits the dataimmediately. Transfer becomes bursty, bringing the 1-ms-scale rateup to near 10 Gbps. We have to use the packet pacing method to avoidpacket loss and utilize such networks. We compared several methodsfor avoiding packet loss at bottlenecks by analysis using TAPEE.We also used a zerocopy version Iperf [19] in experiments in thissection in order to reduce the effect on load related to memory copy.We tried to achieve rate control, and not to collide at bottlenecks,with the fewest specific pieces of equipment. In order to improvethe bottleneck probation of TCP, we implemented a coordinationmechanism between the TCP stack and IPG control function of theNIC. Then, we evaluated the methods using TAPEE.

4.1. Bottleneck at LAN PHY–WAN PHY Conversion

Since we now have sufficient processing power (Intel Xeon 5160)and bus capacity (PCI-Express x8), we can obtain the maximumperformance from the application level to the kernel level. However,this leads to excess traffic for WAN PHY and LAN PHY mixedcircuit.

We settled on an intercontinental circuit from Japan to U.S.A. andback to Japan. It consists of two trans-Pacific OC-192c lines, JGN2and IEEAF. Packets go through Tokyo - Chicago - Seattle - Tokyo.We used jumbo frames with up to 9190-octet IP MTU. This path hasa RTT of 309 ms. We measured the 1-ms-scale behavior of transferat a 9.11-Gbps IPv6 payload rate.

Figure 4 shows the behavior of the transfer when we use no pacingmethod. A host on the edge network consisting of LAN PHY cantransmit packets at the full rate of 10 Gbps. From the 1-ms-scalethroughput (blue plot), packets are transmitted burstily at a rate near10 Gbps because a bulk of ACKs arrive burstily every RTT andthe congestion window size grows burstily. Bursty transfer occurred

Figure 4. Without pacing method, transfer suffered from packetlosses before scaling up to 2 Gbps.

regardless of the value of the 1000-ms throughput. In the phaseswhere the congestion window size grows, especially in the slow-start phase, the arrival of ACKs increases the congestion windowsize rapidly. TCP sends data at the maximum rate until the size ofin-flight data hits the congestion window size. As a result of thesecharacteristics of TCP, the transfer became bursty. By magnifyingthe graph, we can observe that the bursty transfer occurs RTT-periodically [26]. This instantaneous flooding of packets inducespacket losses at the WAN PHY bottleneck points and the switch dropsreceived packets in LAN PHY–WAN PHY conversion process. Thisresult shows that instantaneous flooding in 1-ms-scale induces packetlosses. The red line and orange triangle indicates packet losses anda retransmissions. By these packet losses, TCP exit from the slow-start phase and reduced the rate of increasing the congestion windowsize. We cannot utilize 10GbE LFNs without eliminating these burstytransmissions by using packet pacing.

4.2. Experiments on a Pseudo LFN

Because real LFNs are precious, experiments are often carried out onpseudo LFNs. A pseudo LFN is a virtual network, which emuluateslarge delays by delaying the forwarding of packets. By using pseudo,we can observe only the effect caused by delay. This is a merit foranalysis of the effect of a large-delay; however, it eliminates variouseffects on communications which exist on real LFNs. Clarifying thedifference between the real LFNs and the pseudo LFNs is important.

We used intercontinental circuits for the real LFNs and Anue H Seriesnetwork emulator [4] for the pseudo LFN. The Anue receives dataand stores them on a large buffer and transmits the stored data fromanother port after a specified delay time has passed. The delay canbe set from 0 ms to over 500 ms. We have analyzed and clarified thedifferences between pseudo and real LFNs in previous work [26].

4.3. Application-Level Rate Control

Suppressing the rate at which an application supplies data to the TCPstack at is a well-known straightforward method to suppress the ratethat data is output to the network. We have modified the transmission

Figure 5. Application-level rate control

procedure in the Iperf so that it suppresses the data-supplying rate.It inserts a delay between write(2) system calls by using a timer.

We suppressed the data-supplying rate to approximately 3 Gbps. Inorder to observe only the behavior related to this modification, wesupressed the 1-ms-scale rate to a value less than 8.12 Gbps by usingthe IPG control function of the NIC. Figure 5 shows the result. Thethroughput (blue plot) hit the maximum rate when the window sizeis small because the TCP is in the slow-start phase. Since the Iperfdoes not know the status of the congestion control, it supplies at therate of 3 Gbps even when TCP is in the slow-start phase. The TCPstack cannot flush the data that comes at the rate of 3 Gbps and,therefore, the transmission queue of the TCP stack will be filled withthe superfluous data. When an ACK returns, the TCP stack pops datafrom the queue and transmits a packet immediately. We cannot avoidthe bursty transfer in the slow-start phase by the application-level ratecontrol.

The application-level rate control does not make sense during theslow-start phase. The situation is the same as that of the Iperf withoutthe application-level rate control. Effective pacing cannot be achievedwithout a modificaton to the level of TCP stack or lower.

4.4. Limiting the Window Size

By limiting the window size to a small value, we can suppressthe transmission rate of TCP. Figure 6 shows the result of transferby using the advertized window size of 200MB. The macroscopicthroughput (green plot) is suppressed to approximately 2 Gbps.However, the 1-ms-scale rate (blue plot) reached approximately 9Gbps from the time right after the connection is opened. This burstytransmission induced packet losses during scale up, and the speedremained at a very low value.

Transmissions concentrate around the timing of ACK arrival and thisproduces bursty transfer even if the window size is set to a smallvalue. Without averaging transmissions in the RTT, the rate controlbased on the window size cannot avoid packet losses.

Figure 6. Limiting the window size

Figure 7. IPG control feature of NIC

4.5. Maintaining the Length of Inter Packet Gaps

Chelsio S310E has the feature of controlling the length of interpacket gaps (IPGs). This function maintains the length of thegap between two packets to a specified value. By keeping thelength to a certain value, the 1-ms-scale transmission rate willnever go over the suppressed bandwidth calculated by the follow-ing formula. (suppressed bandwidth) = (full bandwidth) ×(frame size)/((frame size) + (IPG length)). (frame size) isthe length of an Ethernet frame from the preamble to the frame checksequence (FCS). We utilized this function to avoid packet losses atLAN PHY–WAN PHY conversion. We can change the lower limit ofIPG length from 8 octets to 2040 octets.

Figure 7 shows the result. The 1-ms-scale rate is suppressed by usingIPG control, and we managed to avoid packet losses at the LANPHY–WAN PHY bottleneck points.

Figure 8. Coordination of the IPG control feature of NIC and LinuxTCP stack

4.6. Coordination of IPG Control of NIC and TCP stack

We modified the BIC-TCP module to automatically control theIPG parameter. The modified BIC-TCP calculates the appropri-ate IPG length on the basis of the following formula, andchanges the parameter according to the growth of the conges-tion window size. (IPG length) = (frame size) × ((RTT ×bandwidth)/(congestion window size)− 1). Chelsio’s IPG con-trol parameter is accessible through the PCI memory space. Thelimited range of the IPG parameter is sufficient for covering the WANPHY bottleneck. We tested the implementation on Linux 2.6.18 on areal LFN.

Figure 8 shows the result. The 1-ms-scale burst is suppressed in theslow-start phase. As the throughput grows, the IPG length decreases,and the top of the 1-ms average grows. Different from Figure 4,throughput reached maximum available on real LFN. After packetloss occurrence, the throughput started to recover without beingstacked to 0 Gbps. It needs more work to get better utilization ofbandwidth. However, our work brought the average throughput up toapproximately half the bandwidth. Coordinating the IPG control andthe TCP stack is effective for the probing mechanism.

4.7. Using PCI Bus Bottleneck as Pacing Method

Initially, most 10GbE NICs were equipped with the PCI-X hostinterface. PCI-X 1.0 has a theoretical maximum bandwidth of only8.5 Gbps, while PCI-X 2.0 supports speed of over 10 Gbps. Sincea computer equipped with PCI-X 2.0 slots was uncommon, 10GbENICs were often used with PCI-X 1.0 under the bandwidth limitation.We can utilize such bus bottlenecks as a pacing method for LANPHY–WAN PHY mixed networks.

Figure 9 shows the throughput of a communication with NICs(Chelsio T210) installed on a PCI-X 1.0 bus. The transmission ratein 1-ms scale (blue plot) is suppressed up to approximately 6.6 Gbps.We could avoid packet losses at LAN PHY–WAN PHY conversion bythis method when we have to use a NIC with no pacing function. Wecan suppress the rate of PCI-Express by configuring the motherboardor installing cards on a slower slot.

Figure 9. NICs are installed on PCI-X 1.0 slots

4.8. Discussion

IPG control at the NIC is the most precise in time granularity. Itis the most effective rate suppressing method. For single stream,binding IPG control to one stream was a sufficiently effective method.Because it does not have layer 3 information, it cannot control the rateof discrete streams, respectively. Application level control withoutbeing aware of the congestion window size could not suppress burstat the slow-start phase. To make TCP probing work correctly, wemust coordinate the congestion window size and precise rate controlon NICs.

5. Reduction of CPU Utilization

5.1. Segmentation Offload and Checksum Offload

Hardware can do checksum calculation more effectively than soft-ware. LSO is available when we enable TSO. LSO allows the TCPstack to create larger packets than maximum segment size (MSS),and splits packets passed from the kernel into the TCP packets fit tothe MSS by using hardware. This reduces the load on the TCP stack.For TCP/IPv6, GSO is currently available instead of TSO.

Figure 10 shows TCP/IPv6 transfer without GSO and Figure 11shows TCP/IPv6 transfer with GSO. GSO supported TCP/IPv6 andthe throughput reached approximately 8 Gbps; however, it could notkeep the throughput. The peak throughput by using GSO in stablephase is larger than that of not by using GSO by an amount of100-Mbps order. We should utilize TSO and GSO for high-speedcommunications to reduce unnecessary CPU load.

5.2. Copy Reduction

Memory copy produces heavy load on 10GbE. When we simply usewrite(2) system calls and allocate the user space and kernel spacebuffers separately, data will go through, when the user applicationwrites data, user space buffer, kernel space buffer, transmission queue,and arrives at the NIC by DMA.

Figure 10. TCP/IPv6 transfer not by using GSO. Zerocopy Iperf isnot used.

Figure 11. TCP/IPv6 transfer by using GSO. Zerocopy Iperf is notused.

Current Intel Core 2 based Xeon processor systems with DDR2-667 four-channel FB-DIMMs has approximately 10 GB/s of memorybandwidth. However, memory copy by the CPU is limited by the CPUspeed, and therefore, the speed is approximately 2 GB/s. Transferbetween users and the kernel memory will be done by the CPU. Thisis heavy load for TCP communication.

5.3. Zerocopy Iperf

When an application calls write(2) system call in order to send dataover networks, the kernel copies data from the user-space bufferowned by the Iperf application to the kernel-space buffer. It is doneby the copy from user() function in case of Linux. This is thestandard scheme for dealing with system calls; however, this is alarge overhead that wastes CPU and memory bandwidth. We want toget rid of this overhead for high-performance networking.

Iperf is a well-known network measurement tool. Iperf is used invarious network experiments for generating TCP traffics. Iperf useswrite(2) system call to transmit data.

We referred this and developed zerocopy Iperf by patching Iperf’ssource code. We use mmap(2) and sendfile(2) to replaces write(2). Weallocate the buffer on the kernel memory space, which is readable andwritable directly from user process safely by using mmap(2) on sometemporary file. After data is read from a file to cache, the buffer doesnot have the overhead of I/O except for buffer flushing after write. Byusing sendfile(2) by passing the file descriptor returned by mmap(2),we can direct the kernel to transmit data from the kernel buffer to theNIC. This scheme reduces memory copy safely without modifying thekernel, and also reduce CPU utilization. In our experiments, we usedzerocopy Iperf to eliminate the CPU utilization related to memorycopy in order to observe the effect on load by the other causes suchas TCP Segmentation, checksumming, interrupt.

Figure 10 shows TCP/IPv6 transfer without GSO on the sender andwithout zerocopy Iperf. TCP/IPv6 is heavier than TCP/IPv4 andtherefore the throughput could not go over approximately 7.6 Gbps.Figure 7 (which is in another section) shows TCP/IPv6 transferwithout GSO on the sender and with zerocopy Iperf. CPU load isreduced and the stream achieved the maximum throughput. Reducingmemory copy load was effective.

6. Packet Coalescing

6.1. Difference in Effects on Load Reduction between theSender-Side Coalescing and the Receiver-Side Coalescing

We changed the packet coalescing parameter on both the sender hostand the receiver host and observed the CPU utilization and interrupton the hosts and the behavior of packets by using TAPEE. The NICis a Chelsio T210 configured to disable TOE on a PCI-X 1.0 slot(throughput is limited to approximately 6.6 Gbps). Packet Coalescingis configured by the parameter rx-usecs, which specifies the timein microseconds at which all packets received will be coalescedto one interruption. When the parameter is set to a larger value,packet coalescing becomes more aggressive. The configurable rangeof rx-usecs differs among the interfaces. T210 accepts a valuelarger than 1000 μs, while S310E accepts a value of up to 819 μs.We aligned the range to that of S310E. In order to examine the effectof packet coalescing only, we used zerocopy Iperf.

1) We set rx-usecs 50 μs (default value of T210) for the senderand rx-usecs 819 μs for the receiver. The number of interruptionson the sender increased to 15.2 kintr/s and packet losses occurred.CPU utilization was less than 70%.

2) We set rx-usecs 819 μs for the sender and rx-usecs 50 μsfor the receiver. The sender coalesces ACKs, and it could deal withboth data processing and ACK processing. The maximum throughputwas achieved. The rate of interruptions on the receiver host increasedto approximately 16.7 kintr/s; however, this did not adversely affectthe communication. The rate of interruptions on the sender host wassuppressed to less than approximately 2.2 kintr/s, and this freed theload and avoided packet losses.

Figure 12 shows the result. ACK rate concentrated near 6.6 Gbps.

3) We used rx-usecs 50 μs for both the hosts. The interruptionson the receiver host increased to approximately 15 kintr/s. Those onthe receiver host increased to approximately 17 kintr/s and packetlosses occurred.

4) We used rx-usecs 819 μs for both the hosts. Maximum

Figure 12. Packet coalescing by using Chelsio T210 on PCI-X 1.0.Only the sender host used aggressive packet coalescing.

Figure 13. Packet coalescing by using Chelsio T210 on PCI-X 1.0.Both the sender host and the receiver host used aggressive packetcoalescing.

throughput was achieved. Figure 13 shows the result. The plots ofACK rate are scattered to hit 10 Gbps in Figure 13.

The sender host was busy generating packets and maintaining the re-transmission queue. Heavy interruption of the sender hosts by ACKsled to dropping ACKs or the failure of data packet transmission. Thereceiver host had the task of receiving data and checking its integrityand transmitting ACKs, and was not as busy as the sender host. Packetcoalescing at the sender host was essential for stable communication.

By comparing Figure 12 and Figure 13, we can observe that packetcoalescing on the receiver changed the behavior of ACKs. The reasonwhy the plots are considerably different. The ACK rate higher thanthroughput indicates the AKCs are packed together. We investigatethis phenomena more in the next subsection.

Figure 14. rx-usecs = 5 μs on the sender and rx-usecs = 5 μson the receiver. Scaling took 75 seconds.

Figure 15. rx-usecs = 5 μs on the sender and rx-usecs = 819μs on the receiver. Scaling took 228 seconds.

6.2. Effect on the Behavior of Packets

We examined the effect of packet coalescing on the behavior ofpackets more closely by using TAPEE. We changed the rx-usecsparameter and compared the results. We used a combination of S310Eand Linux 2.6.18. The combination could transmit TCP/IPv4 flow at9.56 Gbps on 309-ms RTT. We suppressed the transfer rate by usingan IPG control function in order to avoid packet losses at the LANPHY–WAN PHY conversion points.

Figure 14 shows the result when we did not use aggressive packetcoalescing. The blue plot shows throughput in 1-ms scale and thegray plot shows the rate of the advance in acknowledgment numberfield in 1-ms scale. The 1-ms-scale throughput plots concentrate themaximum value or 0 Gbps. It took approximately 75 seconds for the1000-ms-scale throughput to hit the maximum value.

Figure 15 shows the result when we used aggressive packet coalescingonly on the receiver host. During scaling, throughput and ACKrate is scattered between the maximum value and 0 Gbps. It tookapproximately 228 seconds for the 1000-ms-scale throughput to hit

Figure 16. rx-usecs = 819 μs on the sender and rx-usecs = 5μs on the receiver. Scaling took 271 seconds.

Figure 17. Histogram of the frequency of ACKs classified by theamount of advance in acknowledgment number fields.

the maximum value.

Figure 17 shows the frequency of ACKs for each amount of theadvance in acknowledgment number field when we used packetcoalescing on the receiver host. Packet coalescing on the receiver hostchanged the behavior of ACKs. Because we used 9190-octet MTU,the x-coordinate of the peak of frequency should be 9150 octet or18300 octet according to the standard behavior of TCP. However, thex-coordinate of the peak of frequency is 128100-octet which is 14 *9150. Different from normal TCP, a very large delay in the ACKsoccurred. The degree of delay in ACKs is 1 ACK per 14 data packets.

These data show how packet coalescing changes the behavior ofpackets and changes the scaling. The scaling became different fromthat originally modeled by the algorithm designer. Because S310Ehas LRO feature, when the NIC coalesces received packets, the TCPstack receives multiple TCP packets as a single TCP packet. Whenthere is aggressive packet coalescing, LRO coalesces more packets.TCP congestion control algorithm understands that large data packetshave come. The receiver host sends less ACKs corresponding to lessdata packets. This has the same effect as large delayed ACKs to the

sender host.

Current NewReno, BIC-TCP, CUBIC-TCP cannot deal with suchlarge delayed ACKs. Therefore, when there are large delayed ACKs,the congestion window size does not grow as designed in the theory.Coalescing data packets induced large delayed ACKs and retardedthe scaling.

Figure 16 shows the result when we used aggressive packet coalescingonly on the sender host. It took 271 seconds for the 1000-ms-scalethroughput approximately to hit the maximum value. Coalecing onACKs has the same effect as that on data packets. Both the sender-side and receiver-side packet coalescing retarded the scaling.

6.3. Adaptation to Packet Coalescing by Utilizing theAppropriate Byte Counting

When packet coalescing is used, the number of ACKs per datadecreases to a value less than 1 ACK per 14 data. If the TCP doesnot take into account large-delayed ACKs, this retards the growth ofthe congestion window. A method called appropriate byte counting(ABC) [2] is proposed and it is already implemented on Linux. TheTCP stack calculates the amount of acknowledged data and passesit to the TCP congestion control algorithm modules. The modulescan take into account the value so that the congestion window growson the basis of the exact amount of acknowledged data. The currentimplementation of the TCP NewReno of Linux recognizes delayedACKs by ABC. It increases the growth rate of the congestion windowup to 2 * SMSS per ACK. SMSS is an abbreviation for sendermaximum segment size.

The sysctl parameter net.ipv4.tcp_abc determines the behaviorof ABC. By setting that to 0, ABC is disabled. By setting thatto 1, ABC works conservatively. The scaling procedures in theNewReno module except the that for the slow-start phase workson the basis of the exact amount of acknowledged data. By settingthat to 2, ABC works aggressively. All the scaling procedures in theNewReno module scaling work on the basis of the exact amount ofacknowledged data.

We have observed how ABC changes the behavior of the BIC-TCP. The BIC-TCP is affected by the parameter because it uses thescaling procedures of the NewReno during its slow-start phase. Inthis experiment, packet coalescing is disabled and, therefore, there isa delayed ACK of only 1 ACK per 2 data packets at most.

Figure 18 shows the transition of the amount of in-flight data underconservative ABC. The amount of in-flight data increased up to 3/2times every RTT in the slow-start phase instead of twice every RTT.This is because the slow-start procedure increases the congestionwindow size by only 1 * SMSS per ACK. The increase of the amountof in-flight data is limited by the congestion window size. Because wehave sufficient CPU power and memory bandwidth and there was nopacket loss in this experiment, the amount of in-flight data is nearlyequal to the congestion window size.

Figure 19 shows the transition of the amount of in-flight data underthe aggressive ABC. The amount of in-flight data increased up totwice every RTT. The slow-start procedure increased the congestionwindow size up to 2 * SMSS per ACK. This is the correct scalingof the slow-start phase of NewReno.

In the case that there is no packet coalescing, ABC solves the issuebetween the delayed ACK and congestion control.

Figure 18. Transition of the amount of in-flight data under theconservative ABC

Figure 19. Transition of the amount of in-flight data under theaggressive ABC

RFC3465 restricts ABC from increasing the congestion window sizeto greater than 2 * SMSS per ACK. As mentioned above, theimplementation of the NewReno follows the restriction. However, inhigh-performance computing this restriction is not desirable becauseit decreases the utilization of the bandwidth drastically. The issue ofbursty transmissions that is mentioned in the RFC should be solvedby the packet pacing and the restriction should be relaxed. Thisrestriction does not make much sense in avoiding bursty transmissionsin the case of 10GbE LFNs.

The implementation of BIC-TCP in Linux uses an independentdelayed ACK tracking mechanism. It samples the degree of delayedACK and stores it into a variable delayed_ack. It takes intoaccount the value in delayed ack when it calculates the new valuefor the congestion window size. However, it does not work correctlywhen there are large delayed ACKs such as 1 ACK per 10 data.

We modified the implementation of BIC-TCP and the slow-startprocedure in the NewReno implementation in order to let themtake into account the exact amount of acknowledged data. Themodified TCP increases the congestion window size by the exactamount of acknowledged data. We used variables bytes_ackedand mss_cache in struct tcp_sock. The TCP stack calculatesthem when the sysctl_tcp_abc sysctl parameter is set to a

Figure 20. Packet coalescing is not used. Our modified BIC-TCPis used in order to eliminate the effect of delayed ACKs on scaling.

Figure 21. Packet coalescing is used at the sender host. Ourmodified BIC-TCP is used in order to eliminate the effect of delayedACKs on scaling

positive value. The amount of advance in the acknowledgmentnumber of TCP headers is accumulated to bytes_acked. MSSis sampled from packets and stored into mss_cache. Both of themodificatons to NewReno and BIC-TCP violate the restriction ofRFC3465.

We have observed the behavior of the modified TCP by using anAnue-based pseudo LFN of 500-ms RTT. Figure 20 and Figure 21show the effect of our delayed ACK aware modification. Figure 20shows the transition of throughput when packet coalescing is not usedat both the receiver host and the sender host. Compared to Figure14, which is the result of an experiment with the same configurationexcept the BIC-TCP module used, the transition of throughput isalmost the same. The scaling finished in approximately 66 seconds.Figure 21 shows the transition of throughput when packet coalescingis used at the receiver host. The scaling finished in approximately 67seconds. Our modified BIC-TCP has removed the difference betweenthe small-delayed ACKs and the large-delayed ACKs (induced bypacked coalescing).

By our modification, the TCP began to grow in the same way

regardless of delayed ACKs. Eliminating the effect of delayed ACKsis important in order to obtain high performance and in order toevaluate the TCP congestion control algorithms.

7. Finalizing 10GbE Era Internet2 Land Speed Record

With all the knowledge and techniques discussed in the previoussections, we challenged the Internet2 Land Speed Record (I2-LSR) [16] and demonstrated the effectiveness of our approach. Wehave achieved the IPv6 record of 9.08 Gbps for 5 hours on a round-the-world circuit with a length of 32372 km and an RTT of 522 ms.We used the Chelsio S310E on a PCI Express x8 bus, the Intel Xeon5160 host, and TCP/IPv6. The round-the-world circuit consisted offour oversea 10GbE WAN PHY lines and several overland 10GbELAN PHY lines. The oversea lines were Trans Pacific IEEAF, TransPacific JGN2 and Trans Atlantic SURFnets. We connected these lineswith L2/L3 switches in NoCs in Tokyo, Chicago, Amsterdam andSeattle. Both end-hosts were placed in Tokyo [22] and data wastransmitted through the circuit. VLANs and routes were configuredso that packets from one end-host go to Amsterdam and return toanother end-host. Data packets went in the direction starting fromthe JGN2 line and ACKs went in the reverse direction. The detailedconfiguration is described in the I2-LSR submission page [17].

We have replaced a NetIron 40G in Seattle NoC which we usedin previous LSR challenges with a BigIron RX-4 according to ourinvestigation by using TAPEE. The NetIron 40G had an issue that itsmaximum rate of transmission from a WAN PHY interface was lessthan the value in the 10GbE specification. Therefore, we solved theissue by replacing the switch and transferring data in one directionso that data packets would not be transmitted from the WAN PHYinterface of NetIron 40G in Tokyo.

Precise analysis and tuning enabled our successful challenge. Weused the GSO to reduce the load, zerocopy version Iperf to reducememory transactions and the IPG control by NICs to suppress burstytransmission to avoid packet losses at the LAN PHY–WAN PHYconversions. We also did well-know tunings such as window sizetuning, CPU binding, interruption binding. The modification relatedto delayed ACKs was used only for analysis because it violates theRFC.

8. Related Works

TCP Vegas [6], which is well known for first delay based approachto control congestions, proposed Spike Suppression. It averages datatransmission over RTT to suppress bursty transmission. This conceptis useful to avoid packet losses at bottleneck points. In our research,we avoided packet losses by the IPG control because the granularityof software-based pacings are insufficient.

Takano et al. designed and evaluated precise software pacing mecha-nisms for LFNs [23]. They also used IPGs to suppress the rate. Theytook an approach called Virtual Inter Packet Gap. They developed ascheduler and modified the kernel in order to insert large-size PAUSEframes into the interface queue which work as IPGs when the flow-control of the switch is turned off. This approach can achieve preciserate control; however, it wastes bus bandwidth and CPU. The NIC-based IPG control can replace the virtual IPG.

There have been researches to analyze the behavior of TCP byusing mathematical methods and by modeling computer system. They

give us abstract knowledge to optimize communications; however,it is difficult for them to take into account various effects onreal communications. The inductive approach by performing realexperiments is necessary for practical optimization.

One of the early approaches to challenge I2-LSR is the work by S.Ravot et al. [20]. They tuned TCP and obtained 6.5 Gbps with asingle-stream TCP between Los Angeles and Geneva.

9. Conclusion

We have discussed and categorized the issues that reduce theutilization of TCP-based end-to-end communications over 10GbELFNs. We have shown that the situation concerning 10GbE is morecomplicated than that concerning GbE. Solving one issue brought usto another hidden issue. We have proposed methods to analyze theissues separately and analyzed real TCP transfer. TAPEE is the maintool to detect and analyze these issues. It can obtain data with a veryfine time granularity at a speed up to the wire rate. As the results fromthe above investigation, we have pointed out the issues as follows: (1)CPU speed must be sufficient for memory copy operations, (2) packetcoalescing on a sender host is essential for not dropping arrivingACKs, (3) reducing interruption on receiver host slowed the scalingdown, (4) proper use of TSO including LSO and LRO is effective, (5)congestion control algorithms must be implemented taking accountof large-delayed ACKs made by LRO, (6) the IPG control is the mosteffective pacing method to suppress busty transfer, and (7) we canavoid packet losses at the LAN PHY–WAN PHY conversion pointsby pacing. Furthermore, we showed the coordination between thecongestion window size and the length of IPGs is more efficient thansimple constant-pacing. We should improve it for smaller bottlenecksand multistream TCP in the future.

In order to solve the issues of TCP over 10GbE LFNs, we proposed amodified TCP congestion control algorithm, modified Iperf to checkeffect of application level optimization, implementation of softwaremodification to Linux like delayed ACK aware TCP, zerocopyIperf for reducing memory copy load, and coordination mechanismbetween the IPG control and TCP stack. With all these optimizations,we obtained I2-LSR where we achieved 9.08 Gbps with TCP/IPv6for 5 hours without any packet loss. This record can be achieved bythe orchestration of all the arts we have obtained in this research.We conclude that we have established the basis to realize efficientend-to-end communications over 10GbE LFNs.

Acknowledgment

We thank the following persons for their advice and support for theexperiments: Akira Kato, Seiichi Yamamoto and Hiroshi Tezuka ofthe University of Tokyo; Katsuyuki Hasebe and Hideaki Yoshifujiof WIDE Project; Felix Marti and Wael Noureddine of ChelsioCommunications; Yasuhiro Yoshida of Booz & Company, Inc; Ceesde Laat of University of Amsterdam; Pieter de Boer of SARA; JanEveleth and Bill Mar of Pacific Northwest Gigapop; Linda Winklerof StarLight; Thomas Tam of CANARIE; Yoshitaka Hattori and JinTanaka of JGN2; Ryutaro Kurusu, Masakazu Sakamoto, YukichiIkuta and Takuya Kurihara of Fujitsu Computer Technologies Ltd.We want to acknowledge all of the organizations and their staff, whosupported us and provided light-paths, equipment, and hosting spaces.This research is partially supported by the Special Coordination

Fund for Promoting Science and Technology and a Grant-in-Aidfor Fundamental Scientific Research from the Ministry of Education,Culture, Sports, Science and Technology, Japan.

References

[1] A. Kuznetsov et al., “Linux documentation/networking/ip-sysctls.txt.”

[2] M. Allman, “RFC 3465 – TCP Congestion Control with AppropriateByte Counting,” Feb. 2003.

[3] M. Allman, V. Paxson, and W. R. Stevens, “RFC 2581 – TCPCongestion Control,” April 1999.

[4] “Anue Network Emulators,” http://www.anuesystems.com/.[5] “BIC TCP,” http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/.[6] L. Brakmo, S. W. O’Malley, and L. Peterson, “TCP Vegas: New

techniques for congestion detection and avoidance,” in Proceedingsof the SIGCOMM ’94 Symposium, August 1994, pp. 24–35.

[7] V. Cerf, “RFC 675 – Specification of Internet Transmission ControlProgram,” Dec. 1974.

[8] D. D. Clark, “RFC 813 – Window and Acknowledgement Strategyin TCP,” July 1982.

[9] Force10 Networks, “White Paper – OC192c/STM-64c and 10 Gi-gabit Ethernet WAN PHY.”

[10] “IEEE 802.3ae-2002,” http://www.ieee802.org/3/.[11] J. Postel (ed), “RFC 791 – Internet Protocol,” Sept. 1981.[12] ——, “RFC 793 – Transmission Control Protocol,” Sept. 1981.[13] V. Jacobson, B. Braden, and D. Borman, “RFC 1323 – TCP

Extensions for High Performance,” May 1992.[14] H. Kamezawa, M. Nakamura, J. Tamatsukuri, N. Aoshima, M. In-

aba, and K. Hiraki, “Inter-layer coordination for parallel TCPstreams on long fat pipe networks,” in SC ’04: Proceedings of the2004 ACM/IEEE conference on Supercomputing. Washington, DC,USA: IEEE Computer Society, 2004, p. 24.

[15] A. Kleen, “Linux manpage of tcp.”[16] “Internet2 Land Speed Record,” http://www.internet2.edu/lsr/.[17] “IPv6 Internet2 Land Speed Record Submission 2006/12/31,” http:

//data-reservoir.adm.s.u-tokyo.ac.jp/lsr-200612-02/.[18] M. Nakamura, M. Inaba, and K. Hiraki, “Packet Spacing of

TCP streams on very high latency Gigabit Ethernets,” http://data-reservoir.adm.s.u-tokyo.ac.jp/paper/ia2003.pdf.

[19] NLANR/DAST, “Iperf 1.7.0 - The TCP/UDP bandwidth measure-ment tool,” http://dast.nlanr.net/Projects/Iperf/.

[20] S. Ravot, Y. Xia, D. Nae, X. Su, H. Newman, and J. Bunn,“A practical approach to tcp high speed wan data transfers,” inProceedings of PATHNets 2004 (First Workshop on Provisioningand Transport for Hybrid Networks) San Jose, Oct. 2004.

[21] Y. Sugawara, M. Inaba, and K. Hiraki, “Implementation and Evalua-tion of Fine-grain Packet Interval Control,” in IPSJ Technical ReportOS-100 (in Japanese). IPSJ, Aug. 2005, pp. 85–92.

[22] “T-LEX: Tokyo Lambda Exchange,” http://www.t-lex.net/.[23] R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, and

Y. Ishikawa, “Design and evaluation of precise software pacingmechanisms for fast long-distance networks,” in Proceedings ofPFLDnet 2005 (Third International Workshop on Protocols for FastLong-Distance Networks, Feb. 2005.

[24] H. Xu, “GSO: Generic Segmentation Offload,” http://marc.info/?l=linux-netdev&m=115079480721337&w=2.

[25] L. Xu, K. Harfoush, and I. Rhee, “Binary increase congestioncontrol (bic) for fast long-distance networks,” in Proceedings ofInfocom 2004 (The 23rd Conference of the IEEE CommunicationsSociety), 2004.

[26] T. Yoshino, J. Tamatsukuri, K. Inagami, Y. Sugawara, M. Inaba, andK. Hiraki, “Analysis of 10 Gigabit Ethernet using Hardware Enginefor Performance Tuning on Long Fat-pipe Network,” in Proceedingsof PFLDnet 2007 (Fifth International Workshop on Protocols forFAST Long-Distance Networks), Feb. 2007, pp. 43–48.