results on high throughput and qos between the us and cern

Results on High Throughput and QoS Between the US and CERN

California Institute of technology - Cern External Network Division

[email protected]

Datagrid WP7 - 24 January, 2002

Agenda

• TCP performance over high latency/bandwidth network

• TCP behavior

• TCP limits

• TCP improvement

• Scavenger measurements over the transatlantic link

• Load balancing over the transatlantic links

TCP overview: Slow Start and Congestion Avoidance

Slow Start Congestion Avoidance

Connection opening : cwnd = 1 segment

Exponential increase for cwnd until cwnd = SSTHRESH

cwnd = SSTHRESHAdditiveincrease for cwnd

Retransmission timeoutSSTHRESH:=cwnd/2cwnd:= 1 segment

Retransmission timeout SSTHRESH:=cwnd/2

•Exponential increase for cwnd :for every useful acknowledgment received, cwnd := cwnd + (1 segment size)

•Additive increase for cwnd : for every useful acknowledgment received, cwnd := cwnd + (segment size)*(segment size) / cwnd it takes a full window to increment the window size by one.

TCP overview: Fast Recovery

Slow Start Congestion Avoidance

Connection opening : cwnd = 1 segment

Exponential increase for cwnd until cwnd = SSTHRESH

cwnd = SSTHRESHAdditiveincrease for cwnd

Retransmission timeout SSTHRESH:=cwnd/2 cwnd:= 1 segment

Retransmission timeout SSTHRESH:=cwnd/2

Fast Recovery

Exponentialincrease beyond cwnd Retransmission timeout

SSTHRESH:=cwnd/2

3 duplicate ack received

3 duplicate ack received

Expected ack received cwnd:=cwnd/2

TCP overview: Slow Start and congestion Avoidance Example

Cwnd average of the last 10 samples.

Cwnd average over the life of the connection to that point

Slow start Congestion Avoidance

SSTHRESH

Here is an estimation of the cwnd:

•Slow start : fast increase of the cwnd•Congestion Avoidance : slow increase of the window size

Tests configuration

Uslink- POS 155 MbpsGbEth GbEth

Pcgiga-gbe.cern.ch(Geneva)Plato.cacr.caltech.edu

(California)

Ar1-chicago Cernh 9

CERN<->Caltech (California)

•RTT : 175 ms

•Bandwith-delay product : 2,65 MBytes. It is difficult to evaluate the available bandwidth between CERN and Caltech. By using UDP flows, we transferred data at a rate of 120 Mbit/s without losing any packet (we performed the test during 60s). For higher rate we were losing packets. We deduced from this simple measurements that the available bandwidth was about 120 Mbit/s. We repeated several time this measurement in order to check if the network conditions were changing.

Calren2 / Abilene

Lxusa-ge.cern.ch (Chicago)

GbEth

CERN<->Chicago

•RTT : 110 ms

•Bandwidth-delay-product : 1.9 MBytes.

Tcp flows were generated by Iperf. Tcpdump was used to capture packets flows, and tcptrace and xplot were used to plot and summarize tcpdump data set.

Influence of the initial SSTHRESH on TCP performance (1)

SSTHRESH = 730Kbyte

SSTHRESH = 1460Kbyte

Slow start

Congestion avoidance

Cwnd=f(time) ( Throughput of the connection = 33 Mbit/s) Cwnd=f(time) ( Throughput of the connection = 63 Mbit/s)

During congestion avoidance and without loss, the cwnd increases by one segment each RTT. In our case,we have no loss, so the window increases by 1460 bytes each 175 ms. If the cwnd is equal to 730 kbyte, it takes almost 4 minutes to have a cwnd larger than the bandwidth delay product (2,65 MByte). In other words, we have to wait almost 4 minutes to use the whole capacity of the link !!!

The two plots below show the initial SSTHRESH influence on performance :

Influence of the initial SSTHRESH on TCP performance (2)

Throughput = f(initial ssthresh)

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000

Initial Slow start threshold (ssthresh) in kByte

Thr

ough

put i

n M

bit/s

Max. Tcp w indowsize = 8000 kByte

•Linux TCP implementation evaluates the initial ssthresh from the cwnd size of the previous TCP connection. The initial ssthresh depend on the environment of the previous connection ( bandwidth, RTT, loss rate …) => this parameter is not optimally set when the environment change.

•We modified the linux kernel in order to be able to set the initial ssthresh. We measured the influence of this parameter on TCP performance between Cern and Caltech (2,65 Mbits bandwidth delay product, 175 ms RTT)

BandwidthDelayProduct

Too large cwnd compared to the available bandwidth

•By limiting the cwnd size to the value of the estimated bandwidth delay product (2,65 Mbyte) we get 121 Mbit/s throughput. In this case, no loss occurs.

•When the cwnd can reach larger value than the bandwidth delay product, loss may occur and performance decrease. In the plot here, the cwnd can be larger than the bandwidth delay product, we only get 53 Mbit/s throughput.

3) Back to slow start(Fast Recovery couldn’t repair the lostThe packet lost is detected by timeout => go back to slow start cwnd = 2 MSS)

2) Fast Recovery (Temporary state to repair the lost)

1) A packet is lost

New loss

Losses occur when the window size becomes larger than 3,5 Mbyte. Then the cwnd size go back to 1 MSS (Max. Segment Size) => performance are affected. Losses occur because the network hasn’t enough capacity for storing all the sent packets.

To get high throughput over a high delay/bandwidth network, we need to avoid losses. From this simple example, we see that if the window size is too large, some packets are dropped and performance decreases.

Cwnd when packets are lost because the window size is too large

Losses occur when the cwnd is larger than 3,5 Mbyte

TCP Improvement

• Example:

Assuming the following parameter:

- no loss- SSTHRESH = 65535 byte - RTT = 175 ms (RTT between Cern and Caltech without congestion)- bandwidth = 120 Mbit/s=> bandwidth-delay-product = 2.65 Mbyte

We can easily estimate the time needed to increase the cwnd to a size larger than the bandwidth-delay-product. During congestion avoidance, cwnd is incremented by 1 full-sized segment per round-trip time (RTT). Therefore, to increase the congestion window size from 65535 bytes to 2.65 Mbytes, it takes more than 5 minutes!

• Idea :

Increase the speed at which the window size increase.

• We change the TCP algorithm :

During slow start, for every useful acknowledgement received, cwnd increases by N segments. N is called slow start increment.

During congestion avoidance, for every useful acknowledgement received, cwnd increases by M * (segment size) * (segment size) / cwnd.It’s equivalent to increase cwnd by M segments each RTT. M is called congestion avoidance increment

Note: N=1 and M=1 in common TCP implementations.

TCP tuning by modifying the slow start increment

Slow start, 0.8sSlow start, 2.0s

Slow start , 1.2s

Congestion window (cwnd) as function of the timeSlow start increment = 1, throughput = 98 Mbit/s


Cwnd of the last 10 samples.

Cwnd average over the life of the connection to that point

Slow start, 0.65s



Note that for each connection the SSTHRESH (slow start threshold) is equal to 4,12 Mbyte

TCP tuning by modifying the congestion avoidance increment (1)

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 37.5 Mbit/s

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 61.5 Mbit/s

SSTHREH = 0.783 Mbyte

Cwnd is increased by 1200 bytes in 27 sec.

Cwnd is increased by 12000 bytes (10*1200)in 27 sec.

=> A lager congestion avoidance increment improve performance.

TCP tuning by modifying the congestion avoidance increment (2)

•SSTHRESH < bandwidth-delay product (blue, pink and yellow plots), larger is the congestion avoidance increment, better are the performance.

•SSTHRESH > bandwidth-delay product (red plots), Cwnd is larger than the bandwidth delay product at the end of slow start, so the connection use the whole available bandwidth (120 Mbit/s) since the beginning of congestion avoidance. The increment size doesn’t influence the throughput.

Throughput = f(cwnd_inc) for different initial ssthresh

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60

cwnd_inc

Thr

ough

put

(Mbi

t/s)

init. ssthresh=292 kByte




congestion avoidance increment

Benefice of larger congestion avoidance increment when losses occur

We simulate losses by using a program which drops packets according to a configured loss rate. For the next two plots, the program drop one packet every 10000 packets. 3) cwnd:=cwnd/2

2) Fast Recovery (Temporary state until the loss is repaired)

1) A packet is lost

When a loss occur, the cwnd is divide by two. The performance is determined by the speed at which the cwnd increases after the loss. So higher is the congestion avoidance increment, better is the performance,

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 8 Mbit/s

Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 20 Mbit/s

To achieve high throughput over high latency/bandwidth network, we need to :

• Set the initial slow start threshold (ssthresh) to an appropriate value for the delay and bandwidth of the link. The initial slow start threshold has to be larger than the bandwidth-delay product but not too large.

• Avoid loss by limiting the max cwnd size.

• Recover fast if lost occurs:

• Larger cwnd increment => we increase faster the cwnd after a loss

• Smaller window reduction after a loss (Not studied in this presentation)

• …..

TCP over high latency/bandwidth conclusion

Scavenger Service

•Introduction :Qbone Scavenger Service (QBSS) is an additional best-effort class of service. A small amount of network capacity is allocated for this service; when the default best-effort capacity is underutilized, QBSS can expand itself to consume the unused capacity.

•Goal of our test :

• Does the Scavenger traffic affect performance of the normal best effort traffic?

• Does the Scavenger Service use the whole available bandwidth?

Tests configuration

Uslink- POS 155 Mbps GbEth

Pcgiga-gbe.cern.ch(Geneva)

Ar1-chicago Cernh9Lxusa-ge.cern.ch (Chicago)

GbEth

CERN<->Chicago

•RTT : 116 ms

•Bandwidth-delay-product : 1.9 MBytes.

Cernh9 configuration :

policy-map match-all qbss match ip dscp 8

policy-map qos-policy class qbss bandwidth percent 1 queue-limit 64 random-detect class class-default random-detect

interface ... service-policy output qos-policy

QBSS traffic is marked with DSCP 001000 ( Tos Field 0x20)

TCP and UDP flows were generated by Iperf. QBSS traffic is marked using the TOS option of iperf :iperf –c lxusa-ge –w 4M –p 5021 --tos 0x20

Scavenger and TCP traffic (1)

•We ran two connections at the same time. Packets of connection #2 were marked (scavenger traffic) and packets of the connection #1 were not marked. We measured how the two connections shared the bandwidth.

TCP scavenger traffic influence

0

20

40

60

80

100

120

140

1 2 3 4 5 6

Test #

Trh

ough

put

(Mbp

s)

TCP Connection #1 : packetare not marked (Normal traffic)

TCP Connection #2: packetare marked (scavenger traffic)

•TCP scavenger traffic doesn’t affect TCP normal traffic. Packets of connection #2 are dropped by the scavenger service, so the connection #2 reduces its rate before affecting the connection #1. The throughput of the connection #2 remains low because the loss rate of the scavenger traffic is high.

How does TCP Scavenger traffic use the available bandwidth?

•We performed the same tests without marking the packets. We had a throughput larger than 120 Mbps.

•TCP scavenger traffic doesn’t use the whole available bandwidth. Even if there is no congestion on the link, some packets are dropped by the router. It is probably due to the small size of the queue reserved for scavenger traffic (queue-limit 64).

How does TCP scavenger traffic use the available bandwitth

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

Test #

Thro

ughp

ut (M

bps)

TCP Connection: Packets are marked(Scavenger traffic)

Available bandwidth

•We performed TCP scavenger transfer when the available bandwidth was larger than 120 Mbps. We measured the performance of the scavenger traffic.

Scavenger Conclusion

• TCP Scavenger traffic doesn’t affect normal traffic. TCP connection are very sensitive to loss. When congestion occurs, scavenger packets are dropped first and the TCP scavenger source immediately reduces its rate. Therefore normal traffic isn’t affected.

•Scavenger traffic expands to consume unused bandwidth , but doesn’t use the whole available bandwidth.

•Scavenger is a good solution to transfer data without affecting normal (best effort) traffic. It has to be kept in mind that scavenger doesn’t take advantage of the whole unused bandwidth.

• Future Work

• Our idea is to implement a monitoring tool based on Scavenger traffic. We could generate UDP scavenger traffic without affecting normal traffic in order to measure the available bandwidth.

• Can we use the Scavenger Service to perform tests without affecting the production traffic?

• Does the Scavenger traffic behave as the normal traffic when no congestion occurs?

Load balancing over the transatlantic link

Load balancing allows to optimize resources by distributing traffic over multiple paths for transferring data to a destination. Load balancing can be configured on a per-destination or per-packet basis. On Cisco routers, there are two types of load balancing for CEF (Cisco Express Forwarding) :

•Per-Destination load balancing

•Per-Packets load balancing

Per-Destination load balancing allows router to use multiple paths to achieve load sharing. Packets for a given source-destination pair are guaranteed to take the same path, even if multiple paths are available.

Per-Packets load balancing allows router to send successive data packets without regard to individual hosts. It uses a round-robin method to determine which path each packet takes to the destination.

We tested the two types of load balancing between Chicago and CERN using our two STM-1 circuits.

POS 155 Mbps – circuit #1

GbEth

Pcgiga-gbe.cern.ch(Geneva)Ar1-chicago

Cisco 7507

Cernh9Lxusa-ge.cern.ch (Chicago)

GbEth

CERN<->Chicago

•RTT : 116 ms

•Bandwidth-delay-product : 2 * 1.9 MBytes.

POS 155 Mbps – circuit #2

Cernh9Cisco 7507

Configuration

Load balancing : Per Destination vs Per Packets

Traffic from Chicago to CERN on the link #1

Traffic from Chicago to CERN on the link #2

•MRTG report: traffic from Chicago to CERN

Per Packets Per Destination Per DestinationLoad Balancing type:

When the bulk of data passing through parallel links is for a single source/destination pair, per-destination load balancing overloads a single link while the other link has very little traffic. Per-packet load balancing allows to use alternate paths to the same destination.

Load Balancing Per-Packets and TCP performance

•TCP flow (Cern -> Chicago):

•UDP flow (Cern -> Chicago):

Cern:

[sravot@pcgiga sravot]$ iperf -c lxusa-ge -w 4M -b 20M -t 20

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-20.0 sec 50.1 MBytes 21.0 Mbits/sec

[ 3] Sent 35716 datagrams

Chicago:

[sravot@lxusa sravot]$ iperf -s -w 4M -u

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams

[ 3] 0.0-20.0 sec 50.1 MBytes 21.0 Mbits/sec 0.446 ms 0/35716 (0%)

[ 3] 0.0-20.0 sec 17795 datagrams received out-of-order

50 % of the packets are received out-of-order.

[root@pcgiga sravot]# iperf -c lxusa-ge -w 5700k -t 30

[ ID] Interval Transfer Bandwidth

[ 3] 0.0-30.3 sec 690 MBytes 191 Mbits/sec

By using TCPtrace to plot and summarize TCP flows which were captured by TCPdump, we measured that 99,8 % of the acknowledgements are selective (SACK).

The performance is quite good even if packets are received out of order. The SACK option is efficient. However, we were not able to get a higher throughput than 190 Mbit/s. It seems that receiving too much out-of-order packets limits TCP performance.

Load Balancing Conclusion

We decided to remove the load balancing per packets option because it was impacting the operational traffic. Each packets flow going through the Uslink was disordered.

Load balancing per packets is inappropriate for traffic that depends on packets arriving at the destination in sequence.

• Conclusion

results on high throughput and qos between the us and cern

Documents

cwnd2 cwnd

cwnd larger

cwnd segment size

segment size cwnd

slow increase

fast increase

tcp flows

initial ssthresh