results on high throughput and qos between the us and cern
DESCRIPTION
Results on High Throughput and QoS Between the US and CERN. California Institute of technology - Cern External Network Division [email protected] Datagrid WP7 - 24 January, 2002. Agenda. TCP performance over high latency/bandwidth network TCP behavior TCP limits TCP improvement - PowerPoint PPT PresentationTRANSCRIPT
Results on High Throughput and QoS Between the US and CERN
California Institute of technology - Cern External Network Division
Datagrid WP7 - 24 January, 2002
Agenda
• TCP performance over high latency/bandwidth network
• TCP behavior
• TCP limits
• TCP improvement
• Scavenger measurements over the transatlantic link
• Load balancing over the transatlantic links
TCP overview: Slow Start and Congestion Avoidance
Slow Start Congestion Avoidance
Connection opening : cwnd = 1 segment
Exponential increase for cwnd until cwnd = SSTHRESH
cwnd = SSTHRESHAdditiveincrease for cwnd
Retransmission timeoutSSTHRESH:=cwnd/2cwnd:= 1 segment
Retransmission timeout SSTHRESH:=cwnd/2
•Exponential increase for cwnd :for every useful acknowledgment received, cwnd := cwnd + (1 segment size)
•Additive increase for cwnd : for every useful acknowledgment received, cwnd := cwnd + (segment size)*(segment size) / cwnd it takes a full window to increment the window size by one.
TCP overview: Fast Recovery
Slow Start Congestion Avoidance
Connection opening : cwnd = 1 segment
Exponential increase for cwnd until cwnd = SSTHRESH
cwnd = SSTHRESHAdditiveincrease for cwnd
Retransmission timeout SSTHRESH:=cwnd/2 cwnd:= 1 segment
Retransmission timeout SSTHRESH:=cwnd/2
Fast Recovery
Exponentialincrease beyond cwnd Retransmission timeout
SSTHRESH:=cwnd/2
3 duplicate ack received
3 duplicate ack received
Expected ack received cwnd:=cwnd/2
TCP overview: Slow Start and congestion Avoidance Example
Cwnd average of the last 10 samples.
Cwnd average over the life of the connection to that point
Slow start Congestion Avoidance
SSTHRESH
Here is an estimation of the cwnd:
•Slow start : fast increase of the cwnd•Congestion Avoidance : slow increase of the window size
Tests configuration
Uslink- POS 155 MbpsGbEth GbEth
Pcgiga-gbe.cern.ch(Geneva)Plato.cacr.caltech.edu
(California)
Ar1-chicago Cernh 9
CERN<->Caltech (California)
•RTT : 175 ms
•Bandwith-delay product : 2,65 MBytes. It is difficult to evaluate the available bandwidth between CERN and Caltech. By using UDP flows, we transferred data at a rate of 120 Mbit/s without losing any packet (we performed the test during 60s). For higher rate we were losing packets. We deduced from this simple measurements that the available bandwidth was about 120 Mbit/s. We repeated several time this measurement in order to check if the network conditions were changing.
Calren2 / Abilene
Lxusa-ge.cern.ch (Chicago)
GbEth
CERN<->Chicago
•RTT : 110 ms
•Bandwidth-delay-product : 1.9 MBytes.
Tcp flows were generated by Iperf. Tcpdump was used to capture packets flows, and tcptrace and xplot were used to plot and summarize tcpdump data set.
Influence of the initial SSTHRESH on TCP performance (1)
SSTHRESH = 730Kbyte
SSTHRESH = 1460Kbyte
Slow start
Congestion avoidance
Cwnd=f(time) ( Throughput of the connection = 33 Mbit/s) Cwnd=f(time) ( Throughput of the connection = 63 Mbit/s)
During congestion avoidance and without loss, the cwnd increases by one segment each RTT. In our case,we have no loss, so the window increases by 1460 bytes each 175 ms. If the cwnd is equal to 730 kbyte, it takes almost 4 minutes to have a cwnd larger than the bandwidth delay product (2,65 MByte). In other words, we have to wait almost 4 minutes to use the whole capacity of the link !!!
The two plots below show the initial SSTHRESH influence on performance :
Influence of the initial SSTHRESH on TCP performance (2)
Throughput = f(initial ssthresh)
0
20
40
60
80
100
120
140
0 1000 2000 3000 4000
Initial Slow start threshold (ssthresh) in kByte
Thr
ough
put i
n M
bit/s
Max. Tcp w indowsize = 8000 kByte
•Linux TCP implementation evaluates the initial ssthresh from the cwnd size of the previous TCP connection. The initial ssthresh depend on the environment of the previous connection ( bandwidth, RTT, loss rate …) => this parameter is not optimally set when the environment change.
•We modified the linux kernel in order to be able to set the initial ssthresh. We measured the influence of this parameter on TCP performance between Cern and Caltech (2,65 Mbits bandwidth delay product, 175 ms RTT)
BandwidthDelayProduct
Too large cwnd compared to the available bandwidth
•By limiting the cwnd size to the value of the estimated bandwidth delay product (2,65 Mbyte) we get 121 Mbit/s throughput. In this case, no loss occurs.
•When the cwnd can reach larger value than the bandwidth delay product, loss may occur and performance decrease. In the plot here, the cwnd can be larger than the bandwidth delay product, we only get 53 Mbit/s throughput.
3) Back to slow start(Fast Recovery couldn’t repair the lostThe packet lost is detected by timeout => go back to slow start cwnd = 2 MSS)
2) Fast Recovery (Temporary state to repair the lost)
1) A packet is lost
New loss
Losses occur when the window size becomes larger than 3,5 Mbyte. Then the cwnd size go back to 1 MSS (Max. Segment Size) => performance are affected. Losses occur because the network hasn’t enough capacity for storing all the sent packets.
To get high throughput over a high delay/bandwidth network, we need to avoid losses. From this simple example, we see that if the window size is too large, some packets are dropped and performance decreases.
Cwnd when packets are lost because the window size is too large
Losses occur when the cwnd is larger than 3,5 Mbyte
TCP Improvement
• Example:
Assuming the following parameter:
- no loss- SSTHRESH = 65535 byte - RTT = 175 ms (RTT between Cern and Caltech without congestion)- bandwidth = 120 Mbit/s=> bandwidth-delay-product = 2.65 Mbyte
We can easily estimate the time needed to increase the cwnd to a size larger than the bandwidth-delay-product. During congestion avoidance, cwnd is incremented by 1 full-sized segment per round-trip time (RTT). Therefore, to increase the congestion window size from 65535 bytes to 2.65 Mbytes, it takes more than 5 minutes!
• Idea :
Increase the speed at which the window size increase.
• We change the TCP algorithm :
During slow start, for every useful acknowledgement received, cwnd increases by N segments. N is called slow start increment.
During congestion avoidance, for every useful acknowledgement received, cwnd increases by M * (segment size) * (segment size) / cwnd.It’s equivalent to increase cwnd by M segments each RTT. M is called congestion avoidance increment
Note: N=1 and M=1 in common TCP implementations.
TCP tuning by modifying the slow start increment
Slow start, 0.8sSlow start, 2.0s
Slow start , 1.2s
Congestion window (cwnd) as function of the timeSlow start increment = 1, throughput = 98 Mbit/s
Congestion window (cwnd) as function of the timeSlow start increment = 3, throughput = 116 Mbit/s
Cwnd of the last 10 samples.
Cwnd average over the life of the connection to that point
Slow start, 0.65s
Congestion window (cwnd) as function of the timeSlow start increment = 2, throughput = 113 Mbit/s
Congestion window (cwnd) as function of the timeSlow start increment = 5, throughput = 119 Mbit/s
Note that for each connection the SSTHRESH (slow start threshold) is equal to 4,12 Mbyte
TCP tuning by modifying the congestion avoidance increment (1)
Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 37.5 Mbit/s
Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 61.5 Mbit/s
SSTHREH = 0.783 Mbyte
Cwnd is increased by 1200 bytes in 27 sec.
Cwnd is increased by 12000 bytes (10*1200)in 27 sec.
=> A lager congestion avoidance increment improve performance.
TCP tuning by modifying the congestion avoidance increment (2)
•SSTHRESH < bandwidth-delay product (blue, pink and yellow plots), larger is the congestion avoidance increment, better are the performance.
•SSTHRESH > bandwidth-delay product (red plots), Cwnd is larger than the bandwidth delay product at the end of slow start, so the connection use the whole available bandwidth (120 Mbit/s) since the beginning of congestion avoidance. The increment size doesn’t influence the throughput.
Throughput = f(cwnd_inc) for different initial ssthresh
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60
cwnd_inc
Thr
ough
put
(Mbi
t/s)
init. ssthresh=292 kByte
init. ssthresh=1460 kByte
init. ssthresh=2190 kByte
init. ssthresh=2628 kByte
congestion avoidance increment
Benefice of larger congestion avoidance increment when losses occur
We simulate losses by using a program which drops packets according to a configured loss rate. For the next two plots, the program drop one packet every 10000 packets. 3) cwnd:=cwnd/2
2) Fast Recovery (Temporary state until the loss is repaired)
1) A packet is lost
When a loss occur, the cwnd is divide by two. The performance is determined by the speed at which the cwnd increases after the loss. So higher is the congestion avoidance increment, better is the performance,
Congestion window (cwnd) as function of the time – Congestion avoidance increment = 1, throughput = 8 Mbit/s
Congestion window (cwnd) as function of the time – Congestion avoidance increment = 10, throughput = 20 Mbit/s
To achieve high throughput over high latency/bandwidth network, we need to :
• Set the initial slow start threshold (ssthresh) to an appropriate value for the delay and bandwidth of the link. The initial slow start threshold has to be larger than the bandwidth-delay product but not too large.
• Avoid loss by limiting the max cwnd size.
• Recover fast if lost occurs:
• Larger cwnd increment => we increase faster the cwnd after a loss
• Smaller window reduction after a loss (Not studied in this presentation)
• …..
TCP over high latency/bandwidth conclusion
Scavenger Service
•Introduction :Qbone Scavenger Service (QBSS) is an additional best-effort class of service. A small amount of network capacity is allocated for this service; when the default best-effort capacity is underutilized, QBSS can expand itself to consume the unused capacity.
•Goal of our test :
• Does the Scavenger traffic affect performance of the normal best effort traffic?
• Does the Scavenger Service use the whole available bandwidth?
Tests configuration
Uslink- POS 155 Mbps GbEth
Pcgiga-gbe.cern.ch(Geneva)
Ar1-chicago Cernh9Lxusa-ge.cern.ch (Chicago)
GbEth
CERN<->Chicago
•RTT : 116 ms
•Bandwidth-delay-product : 1.9 MBytes.
Cernh9 configuration :
policy-map match-all qbss match ip dscp 8
policy-map qos-policy class qbss bandwidth percent 1 queue-limit 64 random-detect class class-default random-detect
interface ... service-policy output qos-policy
QBSS traffic is marked with DSCP 001000 ( Tos Field 0x20)
TCP and UDP flows were generated by Iperf. QBSS traffic is marked using the TOS option of iperf :iperf –c lxusa-ge –w 4M –p 5021 --tos 0x20
Scavenger and TCP traffic (1)
•We ran two connections at the same time. Packets of connection #2 were marked (scavenger traffic) and packets of the connection #1 were not marked. We measured how the two connections shared the bandwidth.
TCP scavenger traffic influence
0
20
40
60
80
100
120
140
1 2 3 4 5 6
Test #
Trh
ough
put
(Mbp
s)
TCP Connection #1 : packetare not marked (Normal traffic)
TCP Connection #2: packetare marked (scavenger traffic)
•TCP scavenger traffic doesn’t affect TCP normal traffic. Packets of connection #2 are dropped by the scavenger service, so the connection #2 reduces its rate before affecting the connection #1. The throughput of the connection #2 remains low because the loss rate of the scavenger traffic is high.
How does TCP Scavenger traffic use the available bandwidth?
•We performed the same tests without marking the packets. We had a throughput larger than 120 Mbps.
•TCP scavenger traffic doesn’t use the whole available bandwidth. Even if there is no congestion on the link, some packets are dropped by the router. It is probably due to the small size of the queue reserved for scavenger traffic (queue-limit 64).
How does TCP scavenger traffic use the available bandwitth
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10
Test #
Thro
ughp
ut (M
bps)
TCP Connection: Packets are marked(Scavenger traffic)
Available bandwidth
•We performed TCP scavenger transfer when the available bandwidth was larger than 120 Mbps. We measured the performance of the scavenger traffic.
Scavenger Conclusion
• TCP Scavenger traffic doesn’t affect normal traffic. TCP connection are very sensitive to loss. When congestion occurs, scavenger packets are dropped first and the TCP scavenger source immediately reduces its rate. Therefore normal traffic isn’t affected.
•Scavenger traffic expands to consume unused bandwidth , but doesn’t use the whole available bandwidth.
•Scavenger is a good solution to transfer data without affecting normal (best effort) traffic. It has to be kept in mind that scavenger doesn’t take advantage of the whole unused bandwidth.
• Future Work
• Our idea is to implement a monitoring tool based on Scavenger traffic. We could generate UDP scavenger traffic without affecting normal traffic in order to measure the available bandwidth.
• Can we use the Scavenger Service to perform tests without affecting the production traffic?
• Does the Scavenger traffic behave as the normal traffic when no congestion occurs?
Load balancing over the transatlantic link
Load balancing allows to optimize resources by distributing traffic over multiple paths for transferring data to a destination. Load balancing can be configured on a per-destination or per-packet basis. On Cisco routers, there are two types of load balancing for CEF (Cisco Express Forwarding) :
•Per-Destination load balancing
•Per-Packets load balancing
Per-Destination load balancing allows router to use multiple paths to achieve load sharing. Packets for a given source-destination pair are guaranteed to take the same path, even if multiple paths are available.
Per-Packets load balancing allows router to send successive data packets without regard to individual hosts. It uses a round-robin method to determine which path each packet takes to the destination.
We tested the two types of load balancing between Chicago and CERN using our two STM-1 circuits.
POS 155 Mbps – circuit #1
GbEth
Pcgiga-gbe.cern.ch(Geneva)Ar1-chicago
Cisco 7507
Cernh9Lxusa-ge.cern.ch (Chicago)
GbEth
CERN<->Chicago
•RTT : 116 ms
•Bandwidth-delay-product : 2 * 1.9 MBytes.
POS 155 Mbps – circuit #2
Cernh9Cisco 7507
Configuration
Load balancing : Per Destination vs Per Packets
Traffic from Chicago to CERN on the link #1
Traffic from Chicago to CERN on the link #2
•MRTG report: traffic from Chicago to CERN
Per Packets Per Destination Per DestinationLoad Balancing type:
When the bulk of data passing through parallel links is for a single source/destination pair, per-destination load balancing overloads a single link while the other link has very little traffic. Per-packet load balancing allows to use alternate paths to the same destination.
Load Balancing Per-Packets and TCP performance
•TCP flow (Cern -> Chicago):
•UDP flow (Cern -> Chicago):
Cern:
[sravot@pcgiga sravot]$ iperf -c lxusa-ge -w 4M -b 20M -t 20
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-20.0 sec 50.1 MBytes 21.0 Mbits/sec
[ 3] Sent 35716 datagrams
Chicago:
[sravot@lxusa sravot]$ iperf -s -w 4M -u
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 3] 0.0-20.0 sec 50.1 MBytes 21.0 Mbits/sec 0.446 ms 0/35716 (0%)
[ 3] 0.0-20.0 sec 17795 datagrams received out-of-order
50 % of the packets are received out-of-order.
[root@pcgiga sravot]# iperf -c lxusa-ge -w 5700k -t 30
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.3 sec 690 MBytes 191 Mbits/sec
By using TCPtrace to plot and summarize TCP flows which were captured by TCPdump, we measured that 99,8 % of the acknowledgements are selective (SACK).
The performance is quite good even if packets are received out of order. The SACK option is efficient. However, we were not able to get a higher throughput than 190 Mbit/s. It seems that receiving too much out-of-order packets limits TCP performance.
Load Balancing Conclusion
We decided to remove the load balancing per packets option because it was impacting the operational traffic. Each packets flow going through the Uslink was disordered.
Load balancing per packets is inappropriate for traffic that depends on packets arriving at the destination in sequence.
• Conclusion