tcp loss sensitivity analysis adam krajewski, it-cs-ce
TRANSCRIPT
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Original problem IT-DB was backuping some data to Wigner CC. High transfer rate was required in order to avoid service degradation.
But the transfer rate was:◦ Initial: 1G◦ After some tweaking: 4G
The problem lies within TCP internals!
Meyrin Wigner
4G out of... 10G
2 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP introduction
TCP (Transmission Control Protocol) is a protocol providing reliable, connection-oriented transport over network.
It operates at Transport Layer (4th) of the OSI reference model.
Defined in RFC 793 from 1981. [10 Mb/s at that time! ]
Its reliability means that it guarantees no data will be lost during communication.
It is connection-oriented because it creates and maintains a logical connection between sender and receiver.
Application
Presentation
Session
Transport
Network
Data Link
Physical
3 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP operation basics: connection init
Connection initiation: 3-way handshake
Sequence counters setup:◦ Client sends random SEQ number (ISN – Initial
Sequence Number)◦ Server confirms it (ACK=SEQ+1) and sends its
own SEQ number◦ Client confirms and syncs to server’s SEQ
number
SYN, SEQ=1
ACK=8
SYN, ACK=2, SEQ=7
ServerClient
4 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP is reliable
TCP uses SEQ and ACK numbers to monitor which segments have been successfully transmitted.
Lost packets are easily detected and retransmitted. This provides reliability.
...but do we really need to wait for every ACK?
SEQ=1, Len=1000
ACK=1001
ServerClient
SEQ=1001, Len=1000
SEQ=1001, Len=1000
Timeout
5 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP window (1)Let us assume that we have a client and a server connected with a 1G Ethernet link with 1 ms delay...
RTT = 2ms, B = 1 Gbps
CLIENT SERVERDATA
ACK
2ms
If we send packets of 1460 bytes each and wait for acknowledgment, then the effective transfer rate is 1460B/2ms = 5.84 Mbps. So we end up using 0.5% of available bandwidth.
6 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP window (2)In order to use the full bandwidth we need to send at least
unacknowledged bytes. This way we make sure we will not just sleep and wait for ACKs to come, but keep sending traffic in the meantime.
The factor is referred to as Bandwidth-Delay product.
CLIENT SERVER1
ACK-1
2ms
2
ACK-2
3
ACK-3
4
ACK-4
5
ACK-5
7 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP window (3) Client sends a certain amount of segments one after another. As soon as first ACK is received it assumes transmission is fine and continues to send segments.
Basically, at each time t0 there are W unacknowledged bytes on the wire.
W is referred to as TCP Window Size
W != const
SEQ=1, Len=1460
ACK=1461
ServerClient
SEQ=1461, Len=1460
SEQ=2921, Len=1460SEQ=4381, Len=1460
SEQ=5841, Len=1460
SEQ=8761, Len=14601461 2921 4381 5841 7301 8761 SEQ1
W ACK=2921
8 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Window growth
t
Wdef
W
W1
0.5 W1
DROP
Then, it grows by 1 every RTT:wnd ← wnd + 1
This is an AIMD algorithm (Additive Increase Multiplicative Decrease).
But if a packet loss is detected, the window is cut in half:wnd ← wnd/2
It starts with a default valuewnd ← Wdef
α
α
9 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Congestion control Network congestion appears when there is more traffic that the switching devices can handle.
TCP prevents congestion! ◦ It starts with a small window thus low
transfer rate◦ Then the window grows linearly thus
increasing transfer rate◦ TCP assumes that congestion is happening
when there are packet losses. So in case of a packet loss it cuts the window size in half thus reducing transfer rate.
t [s]
B0
B [Gbps]
Bc
0.5 Bc
α
α
10 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Long Fat Network problem
Long Fat Networks (LFNs) ◦ high bandwidth-delay product ()◦ Example: CERN link to Wigner.
To get full bandwith on 10G link with 25ms RTT we need a window:
If drop occurs at peak throughput...
11 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
TCP algorithms (1)The default TCP algorithm (reno) behaves poorly.
Other congestion control algorithms have been developed such as:
◦ BIC
◦ CUBIC
◦ Scalable
◦ Compund TCP
Each of them behaves differently and we want to see how they perform in our case...
12 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Testbed Two servers connected back-to-back.
How the congestion control algorithm performance changes with packet loss and delay ?
Tested with iperf3.
Packet loss and delay emulated using NetEm in Linux:◦ tc qdisc add dev eth2 root netem loss 0.1 delay 12.5ms
13 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
0 ms delay, default settings
0.0001 0.001 0.01 0.1 1 100
1
2
3
4
5
6
7
8
9
10
cubic
reno
bic
scalable
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
14 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
0.0001 0.001 0.01 0.1 1 100
0.2
0.4
0.6
0.8
1
1.2
cubic
reno
bic
scalable
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
25ms delay, default settings
PROBLEM
15 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Growing the window (1) We need...
Default settings:◦ net.core.rmem_max = 124K◦ net.core.wmem_max = 124K◦ net.ipv4.tcp_rmem = 4K 87K 4M◦ net.ipv4.tcp_wmem = 4K 16K 4M◦ net.core.netdev_max_backlog = 1K
WARNING: The real OS numbers are in pure bytes count so less readable.
Window can’t grow bigger than maximum TCP send and receive buffers!
16 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Growing the window (2)We tune TCP settings on both server and client.
Tuned settings:◦ net.core.rmem_max = 67M◦ net.core.wmem_max = 67M◦ net.ipv4.tcp_rmem = 4K 87K 67M◦ net.ipv4.tcp_wmem = 4K 65K 67M◦ net.core.netdev_max_backlog = 30K
WARNING: TCP buffer size doesn’t translate directly to window size because TCP uses some portion of its buffers to allocate operational data structures. So the effective window size will be smaller than maximum buffer size set.
Helpful link: https://fasterdata.es.net/host-tuning/linux/
Now it’s fine
17 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
25 ms delay, tuned settings
0.0001 0.001 0.01 0.1 1 100
1
2
3
4
5
6
7
8
9
10
cubic
reno
scalable
bic
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
18 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
0.0001 0.001 0.01 0.1 1 100
1
2
3
4
5
6
7
8
9
10
cubic
scalable
reno
bic
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
0 ms delay, tuned settings
DEFAULT CUBIC
19 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Bandwith fairness Goal: Each TCP flow should get a fair share of available bandwidth
10G 10G
5G 5G
B
t
10G
5G
20 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Bandwith competition How do congestion control algorithms compete?
Tests were done for:◦ Reno, BIC, CUBIC, Scalable◦ 0ms and 25ms delay
Two cases:◦ 2 flows starting at the same time◦ 1 flow delayed by 30s
10G 10G
10Gcongestion
21 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
cubic vs bic
0.0001 0.001 0.01 0.1 1 100123456789
10
Congestion control0ms
cubic
bic
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
24 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
cubic vs scalable
0.0001 0.001 0.01 0.1 1 100
2
4
6
8
10
12
Congestion control0ms
cubic
scalable
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
25 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
cubic vs reno
0.0001 0.001 0.01 0.1 1 100
1
2
3
4
5
6
7
8
9
10
Congestion control25ms delay
cubic
reno
Packet loss rate [%]
Ba
nd
wid
th [G
bp
s]
27 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Why default cubic? (1) ◦ BIC was the default Linux TCP congestion algorithm until kernel version
2.6.19
◦ It was changed to CUBIC in kernel 2.6.19 with commit message:◦ ”Change default congestion control used from BIC to the newer CUBIC which it the successor to
BIC but has better properties over long delay links.”
◦ Why?
28 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Why default cubic? (2)
29 / 37
◦ It is more friendly to other congestion algorithms◦ to Reno...◦ to CTCP (Windows default)...
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Why default cubic? (3) It has better RTT fairness properties
30 / 37
Wigner
Meyrin
0 ms RTT25 ms RTT
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Why default cubic? (4)
31 / 37
0 ms RTT0 ms RTT
25 ms RTT25 ms RTT
RTT fairness test:
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
The real problem
100G
100G
ToR
ToR
LAG
LAG
Packet losses on:• Links (dirty fibers, ...)• NICs (hardware limitations, bugs, ...)
Difficult to troubleshoot!
Lots of networks paths!
Wigner
Meyrin
32 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Conclusions To achieve high transfer rate to Wigner:
33 / 37
User part Networking team part
Collaboration is essential !
• Tune TCP settings!• Follow https://fasterdata.es.net/host-tuning/• Make sure you use at least CUBIC for Linux
and CTCP for Windows
• Avoid BIC:• Better performance on lossy links, but• unfair to other congestion algorithms• unfair to itself (RTT fairness)
• Submit a ticket if you observe low network performance for your data transfer
• Minimizing packet loss throughout the network
• Fiber cleaning
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Setting TCP algorithms (1)
OS-wide setting (sysctl ...):◦ net.ipv4.tcp_congestion_control = reno
◦ net.ipv4.tcp_available_congestion_control = cubic reno
Local setting:◦ setsockopt(), TCP_CONGESTION optname
◦ net.ipv4.tcp_allowed_congestion_control = cubic reno
bic
Add new: # modprobe tcp_bic
34 / 37
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
Setting TCP algorithms (2)
35 / 37
OS-wide settings:
C:\> netsh interface tcp show global
◦ Add-On Congestion Control Provider: ctcp
Enabled by default in Windows Server 2008 and newer.
Needs to be enabled manually on client systems:
◦ Windows 7: C:\>netsh interface tcp set global congestionprovider=ctcp
◦ Windows 8/8.1 [Powershell]: Set-NetTCPSetting –CongestionProvider ctcp
ADAM KRAJEWSKI - TCP CONGESTION CONTROL
References [1] http://indico.cern.ch/event/36803/material/slides/0.pdf
[2] https://www.cnaf.infn.it/~ferrari/papers/myslides/2006-04-03-hepix-ferrari.pdf
36 / 37