journey to the center of the linux kernel
TRANSCRIPT
-
7/24/2019 Journey to the Center of the Linux Kernel
1/25
Journey to the Center of the Linux Kernel:
Traffic Control, Shaping and QoS
Julien Vehent-see revisions
1 Introduction
This document describes the Traffic Control subsystem of the Linux Kernel in depth, algorithm
by algorithm, and shows how it can be used to manage the outgoing traffic of a Linux system.Throughout the chapters, we will discuss both the theory behind and the usage of Traffic Control,
and demonstrate how one can gain a complete control over the packets passing through his
system.
a QoS graph
The initial target of this paper was to gain a better control over a small !L uplink. "nd it grew
over time to cover a lot more than that. ##$ of the information provided here can be applied to
any type of server, as well as routers, firewalls, etc%
http://jve.linuxwall.info/http://jve.linuxwall.info/http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisionshttp://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisionshttp://jve.linuxwall.info/ -
7/24/2019 Journey to the Center of the Linux Kernel
2/25
The Traffic Control topic is large and in constant evolution, as is the Linux Kernel. The real
credit goes to the developers behind the /netdirectory of the kernel, and all of the researchers
who created and improved all of these algorithms. This is merely an attempt to document someof this work for the masses. "ny participation and comments are welcome, in particular if you
spotted an inconsistency somewhere. &lease email 'ulien(at)linuxwall.info, your messages are
always most appreciated.
*or the technical discussion, since the L"+TC mailing list doesnt exists anymore, try those two-
/etfilter users mailing listfor general discussions /etev mailing list) is where magic
happens 0developers 1L2
2 oti!ation
This article was initially published in the french issue of 3nu4Linux 1aga5ine *rance 6789, in1ay 8:7:. 3L1* is kind enough to provide a contract that release the content of the article
under Creative Common after some time. ; extended the initial article
-
7/24/2019 Journey to the Center of the Linux Kernel
3/25
;n the ;nternet world, everything is packets. 1anaging an network means managing packets- how
they are generated, router, transmitted, reorder, fragmented, etc% Traffic Control works on
packets leavingthe system. ;t doesnt, initially, have as an ob'ective to manipulate packetsentering the system 0although you could do that, if you really want to slow down the rate at
which you receive packets2. The Traffic Control code operates between the ;& layer and the
hardware driver that transmits data on the network. e are discussing a portion of code thatworks on the lower layers of the network stack of the kernel. ;n fact, the Traffic Control code is
the very one in charge of constantly furnishing packets to send to the device driver.
;t means that the TC module, the packet scheduler, is permanently activate in the kernel. Dven
when you do not explicitly want to use it, its there scheduling packets for transmission. @ydefault, this scheduler maintains a basic %ueue0similar to a *;*E type
-
7/24/2019 Journey to the Center of the Linux Kernel
4/25
/etfilter can be used to interact directly with the structure representing a packet in the kernel.
This structure, the skJbuff, contains a field called JJuI8 nfmark that we are going to modify.
TC will then read that value to select the destination class of a packet.
The following iptables rule will apply the mark ?: to outgoing packets 0EFT&FT chain2 sent by
the web server 0TC& source port is ?:2.
# iptables -t mangle -A OUTPUT -o eth0 -p tcp --sport 80 -j MARK --set-mark 80
e can control the application of this rule via the netfilter statistics-
# iptables -L OUTPUT -t mangle -!hain OUTPUT "polic A!!$PT %&'0% packets( '0)M btes*pkts btes target prot opt in o+t so+rce ,estination%8). '0)M MARK tcp -- an eth0 an/here an/here tcp spt/// MARK1set 012030144444444
Mou probably noticed that the rule is located in the mangle table. e will go back to that a little
bit later.
"'" To cla$$e$ in a tree
To manipulate TC policies, we need the /$#in/tcbinary from theiproute package0aptitude
install iproute2.
The iproute package must match your kernel version. Mour distributions package manager will
normally take care of that.
e are going to create a tree that represents our scheduling policy, and that uses the HT@
scheduler. This tree will contain two classes- one for the marked traffic 0TC& sport ?:2, and onefor everything else.
# tc 5,isc a,, ,e eth0 root han,le ' htb ,e4a+lt 60# tc class a,, ,e eth0 parent '0 classi, ''0 htb rate 600kbit ceil 600kbitprio ' mt+ '200# tc class a,, ,e eth0 parent '0 classi, '60 htb rate 86&kbit ceil '06&kbitprio 6 mt+ '200
The two classes are attached to the root. Dach class has a guaranteed bandwidth 0rate value2 and
an opportunistic bandwidth 0ceil value2. ;f the totality of the bandwidth is not used, a class willbe allowed to increased its flow rate up to the ceil value. Etherwise, the rate value is applied. ;t
means that the sum of the rate values must correspond to the total bandwidth available.
;n the previous example, we consider the total upload bandwidth to be 7:8Nkbits4s, so class 7:
0web server2 gets 8::kbits4s and class 8: 0everything else2 gets ?8N kbits4s.
TC can use both itand p$notations, but they dont have the same meaning. itis the rate
in kiloAbits per seconds, and p$is in kiloAbytes per seconds. ;n this article, ; will the kbit
notation only.
http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=blob;f=include/linux/skbuff.hhttp://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=blob;f=include/linux/skbuff.hhttp://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2 -
7/24/2019 Journey to the Center of the Linux Kernel
5/25
"'- Connecting the .ar&$ to the tree
e now have on one side a traffic shaping policy, and on the other side packets marking. Toconnect the two, we need a filter.
" filter is a rule that identify packets 0handle parameter2 and direct them to a class 0fw flowidparameter2. !ince several filters can work in parallel, they can also have a priority. " filter must
be attached to the root of the Bo! policy, otherwise, it wont be applied.
# tc 4ilter a,, ,e eth0 parent '0 protocol ip prio ' han,le 80 4/ 4lo/i,''0
e can test the policy using a simply client4server setup. /etcat is very useful for such testing.
!tart a listening process on the server that applies the policy using-
# nc -l -p 80 7 3,e3ero
"nd connect to it from another machine using -
# nc ')69'.89'9' 80 : 3,e3n+ll
The server process will send 5eros 0taken from 4dev45ero2 as fast as it can, and the client will
receive them and throw them away, as fast as it can.
Fsing iptrafto monitor the connection, we can supervise the bandwidth usage 0bottom right
corner2.
The value is 7##.8:kbits4s, which is close enough to the 8::kbits4s target. The precision of the
scheduler depends on a few parameters that we will discuss later on.
"ny other connection from the server that uses a source port different from TC&4?: will have aflow rate between ?8Nkbits4s and 7:8Nkbits4s 0depending on the presence of other connections in
parallel2.
- Tenty Thou$and League$ nder the Code
/ow that we en'oyed this first contact, it is time to go back to the fundamentals of the Buality of
!ervice of Linux. The goal of this chapter is to dive into the algorithms that compose the traffic
control subsystem. Later on, we will use that knowledge to build our own policy.
The code of TC is located in the net/$cheddirectory of the sources of the kernel. The kernelseparates the flows entering the system 0ingress2 from the flows leaving it 0egress2. "nd, as we
said earlier, it is the responsibility of the TC module to manage the egress path.
-
7/24/2019 Journey to the Center of the Linux Kernel
6/25
The illustration below show the path of a packet inside the kernel, where it enters 0ingress2 and
where it leaves 0egress2. ;f we focus on the egress path, a packet arrives from the layer N 0TC&,
F&, %2 and then enter the ;& layer 0not represented here2. The /etfilter chains EFT&FT and&E!T+EFT;/3 are integrated in the ;& layer and are located between the ;& manipulation
functions 0header creation, fragmentation, %2. "t the exit of the /"T table of the
&E!T+EFT;/3 chain, the packet is transmitted to the egress
-
7/24/2019 Journey to the Center of the Linux Kernel
7/25
This command means attach a root
-
7/24/2019 Journey to the Center of the Linux Kernel
8/25
$1ample >nternet Batagram ?ea,er 4rom R! %)'
This algorithm is defined in net4sched4schJgeneric.cand represented in the diagram below.
dia source
The length of a band, representing the number of packet it can contain, is set to 7:: by default
and defined outside of TC. ;ts a parameter that can be set using ifconfig, and visuali5ed in 4sys-
# cat 3ss3class3net3eth03t1C5+e+eClen'000
Ence the default value of 7::: is passed, TC will start dropping packets. This should very rarelyhappen because TC& makes sure to adapt its sending speed to the capacity of both systems
participating in the communication 0thats the role of the TC& slow start2. @ut experimentsshowed that increased that limit to 7:,:::, or even 7::,:::, in some very specific cases of
gigabits networks can improve the performances. ; wouldnt recommend touching this value
unless you really now what you are doing. ;ncreasing a buffer si5e to a too large value can havevery negative side effect on the
-
7/24/2019 Journey to the Center of the Linux Kernel
9/25
S(Q305(*LT36*S630I7IS+gives the number of buckets and default to 7:8N
S(Q305T6defines the depth of each bucket, and defaults to 78? packets
#,e4ine @DCB$PT? '68 3E ma1 n+mber o4 packets per 4lo/ E3#,e4ine @DCB$AULTC?A@?CB>=>@OR '06&
These two value determine the maximum number of packets that can be
-
7/24/2019 Journey to the Center of the Linux Kernel
10/25
destination ;&- 79=.778.78#.87=
destination port- 87NS
3E>P so+rce a,,ress in he1a,ecimalE3
h' %e444e40
3E>P Bestination a,,ress in he1a,ecimalE3h6 a4%08',%
3E 0. is the protocol n+mber 4or T!P "bits %6 to 80 o4 the >P hea,er*e per4orm a OR bet/een the ariable h6 obtaine, in the preio+s stepan, the T!P protocol n+mberE3h6 h6 OR 0.
3E i4 the >P packet is not 4ragmente,( /e incl+,e the T!P ports in the hashE33E '4)008.6 is the he1a,ecimal representation o4 the so+rce ,estination portse per4orm another OR /ith this al+e an, the h6 ariable
E3h6 h6 OR '4)008.6
3E An, 4inall( /e +se the Nenkins algorithm /ith some a,,itional gol,enn+mbersThis jhash 4+nction is ,e4ine, some/here else in the kernel so+rce co,eE3h jhash"h'( h6( pert+rbation*
The result obtained is a hash value of I8 bits that will be used by !*B to select the destination
bucket of the packet. @ecause theperturbvalue is regenerated every 7: seconds, the packets
from a reasonably long connexion will be directed to different buckets over time.
@ut this also means that !*B might break the se
-
7/24/2019 Journey to the Center of the Linux Kernel
11/25
!*B scheduler that works with 7: buckets only and considers the ;& addresses of the packets in
the hash.
This discipline is classless as well, which means we cannot direct packet to another schedulerwhen they leave !*B. &ackets are transmitted to the network interface only.
-'2 Cla$$ful 0i$cipline$
-'2'1 T8( 4 To&en 8uc&et (ilter
Fntil now, we looked at algorithm that do not allow to control the amount of bandwidth. !*Band &*;*EJ*"!T give the ability to smoothen the traffic, and even to prioriti5e it a bit, but not to
control its throughput.
;n fact, the main problem when controlling the bandwidth is to find an efficient accounting
method. @ecause counting in memory is extremely difficulty and costly to do in realAtime,
computer scientists took a different approach here.
;nstead of counting the packets 0or the bits transmitted by the packets, its the same thing2, the
Token @ucket *ilter algorithm sends, at a regular interval, a tokeninto a bucket. /ow this isdisconnected from the actual packet transmission, but when a packet enters the scheduler, it will
consume a certain number of tokens. ;f there is not enough tokens for it to be transmitted, the
packet waits.
Fntil now, with !*B and &*;*EJ*"!T, we were talking about packets, but with T@* we nowhave to look into the bits contained in the packets. Lets take an example- a packet carrying ?:::
bits 07K@2 wishes to be transmitted. ;t enters the T@* scheduler and T@* control the content of
its bucket- if there are ?::: tokens in the bucket, T@* destroys them and the packet can pass.Etherwise, the packet waits until the bucket has enough tokens.
The fre
-
7/24/2019 Journey to the Center of the Linux Kernel
12/25
!o with a very large burst value, say 7,:::,::: tokens, we would let a maximum of ?I fully
loaded packets 0roughly 78NK@ytes if they all carry their maximum 1TF2 traverse the scheduler
without applying any sort of limit to them.
To overcome this problem, and provides better control over the bursts, T@* implements a second
bucket, smaller and generally the same si5e as the 1TF. This second bucket cannot store largeamount of tokens, but its replenishing rate will be a lot faster that the one of the big bucket. This
second rate is calledpeakrateand it will determine the maximum speed of a burst.
Lets take a step back and look at those parameters again. e have-
peakrate rate - the second bucket fills up faster than the main one, to allow and control
bursts. ;f the peakrate value is infinite, then T@* behaves as if the second bucket didnt
exist. &ackets would be de
-
7/24/2019 Journey to the Center of the Linux Kernel
13/25
'ust =I bytes each. "nd from those =I bytes, only N? are from the original packet, the rest is
occupied by the "T1 headers.
!o where is the problem V Considering the following network topology.
The Bo! box is in charge of performing the packet scheduling before transmitting it to themodem. The packets are then split by the modem into "T1 cells. !o our initial 7.=K@ ethernet
packets is split into I8 "T1 cells, for a total si5e of I8 = bytes of headers per cell W 7=:: bytes
of data G 0I8=2W7=:: G 7SS: bytes. 7SS: bytes is 7:.S$ bigger than 7=::. hen "T1 is used,we lose 7:$ of bandwidth compared to an ethernet network 0this is an estimate that depend on
the average packet si5e, etc%2.
;f T@* doesnt know about that, and calculates its rate based on the sole knowledge of the
ethernet 1TF, then it will transmit 7:$ more packets than the modem can transmit. The modemwill start
-
7/24/2019 Journey to the Center of the Linux Kernel
14/25
T@* gives a pretty accurate control over the bandwidth assigned to a
-
7/24/2019 Journey to the Center of the Linux Kernel
15/25
%uantu.is similar to the
-
7/24/2019 Journey to the Center of the Linux Kernel
16/25
*or very small or very large bandwidth, it is important to tune r2%properly. ;f r8< is too large,
too many packets will leave a
-
7/24/2019 Journey to the Center of the Linux Kernel
17/25
@ut in most cases, this optimi5ation is simply deactivated, as shown below-
# cat 3ss3mo,+le3schChtb3parameters3htbChsteresis0
-'2'" Co0el
http-44
-
7/24/2019 Journey to the Center of the Linux Kernel
18/25
Home networks are tricky to shape, because everybody wants the priority and its difficult to
predetermine a usage pattern. ;n this chapter, we will build a TC policy that answer general
needs. Those are-
Low latency. The uplink is only 7.=1bps and the latency shouldnt be more than I:ms
under high load. e can tune the buffers in the
-
7/24/2019 Journey to the Center of the Linux Kernel
19/25
echo #---ssh - i, 00 - rate '.0 kbit ceil ''60 kbit3sbin3tc class a,, ,e eth0 parent '' classi, '00 htb rate '.0kbit ceil ''60kbit b+rst '2k prio
# @D /ill mi1 the packets i4 there are seeral# @@? connections in parallel# an, ens+re that none has the priorit
echo #--- s+b ssh s453sbin3tc 5,isc a,, ,e eth0 parent '00 han,le '00 s45 pert+rb '0 limit 6
echo #--- ssh 4ilter3sbin3tc 4ilter a,, ,e eth0 parent '0 protocol ip prio han,le 00 4/ 4lo/i, '00
echo #--- net4ilter r+le - @@? at 003sbin3iptables -t mangle -A PO@TROUT>GV -o eth0 -p tcp
--tcp-4lags @JG @JG -,port 66 -j !OGGMARK
--set-mark 00
The first rule is the definition of the HT@ class, the leaf. ; connects back to its parent 7-7, defines
a rate of 7S:kbit4s and can use up to 778:kbit4s by borrowing the difference from other leaves.
The burst value is set to 7=k, with is 7: full packets with a 1TF of 7=:: bytes.
The second rule defines a !*B
-
7/24/2019 Journey to the Center of the Linux Kernel
20/25
Let us now load the script on our gateway, and visualise the GV eth0 TRA>! !OGTROL RUL$@ OR ramiel
#-clean+p
RTG$TL>GK ans/ers Go s+ch 4ile or ,irector#-,e4ine a ?TH root 5,isc#--+plink - rate '.00 kbit ceil '.00 kbit#---interactie - i, '00 - rate '.0 kbit ceil '.00 kbit#--- s+b interactie p4i4o#--- interactie 4ilter#--- net4ilter r+le - all UBP tra44ic at '00#---tcp acks - i, 600 - rate 60 kbit ceil '.00 kbit#--- s+b tcp acks p4i4o#--- 4iltre tcp acks#--- net4ilter r+le 4or T!P A!Ks /ill be loa,e, at the en,#---ssh - i, 00 - rate '.0 kbit ceil ''60 kbit#--- s+b ssh s45#--- ssh 4ilter#--- net4ilter r+le - @@? at 00#---http branch - i, &00 - rate 800 kbit ceil '.00 kbit#--- s+b http branch s45#--- http branch 4ilter#--- net4ilter r+le - http3s#---,e4a+lt - i, ))) - rate '.0kbit ceil '.00kbit#--- s+b ,e4a+lt s45#--- 4iltre ,e4a+lt#--- propagating marks on connections#--- Mark T!P A!Ks 4lags at 600
Tra44ic !ontrol is +p an, r+nning# 3etc3net/ork3i4-+p9,3ln/Cgate/aCtc9sh sho/
---- 5,iscs ,etails -----5,isc htb ' root re4cnt 6 r65 &0 ,e4a+lt ))) ,irectCpacketsCstat 0 er 9'%5,isc p4i4o ''00 parent ''00 limit '0p5,isc p4i4o '600 parent '600 limit '0p5,isc s45 '00 parent '00 limit 6p 5+ant+m '2'&b 4lo/s 63'06& pert+rb'0sec5,isc s45 '&00 parent '&00 limit 6p 5+ant+m '2'&b 4lo/s 63'06& pert+rb'0sec5,isc s45 '))) parent '))) limit 6p 5+ant+m '2'&b 4lo/s 63'06& pert+rb'0sec
---- 5,iscs statistics --5,isc htb ' root re4cnt 6 r65 &0 ,e4a+lt ))) ,irectCpacketsCstat 0@ent '.%%.)20 btes '626' pkt ",roppe, &8'( oerlimits 68')0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc p4i4o ''00 parent ''00 limit '0p@ent '80..& btes ')82 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc p4i4o '600 parent '600 limit '0p@ent 2.0%&06 btes '008)) pkt ",roppe, &8'( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc s45 '00 parent '00 limit 6p 5+ant+m '2'&b pert+rb '0sec
-
7/24/2019 Journey to the Center of the Linux Kernel
21/25
@ent 0 btes 0 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc s45 '&00 parent '&00 limit 6p 5+ant+m '2'&b pert+rb '0sec@ent )%)0&)% btes '2.86 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 05,isc s45 '))) parent '))) limit 6p 5+ant+m '2'&b pert+rb '0sec@ent '')88% btes .%22 pkt ",roppe, 0( oerlimits 0 re5+e+es 0*rate 0bit 0pps backlog 0b 0p re5+e+es 0
The output below is 'ust two types of output tc can generate. Mou might find the class statistics to
be helpful to diagnose leaves consumption-
# tc -s class sho/ ,e eth0
F999tr+ncate,999I
class htb '&00 parent '' lea4 '&00 prio & rate 800000bit ceil '.00Kbitb+rst 0Kb cb+rst '.00b@ent '06)002 btes '.&6. pkt ",roppe, 0( oerlimits 0 re5+e+es 0*
rate 6.6&bit 2pps backlog 0b 0p re5+e+es 0len,e, '.&6& borro/e, 6 giants 0tokens &%)'620 ctokens '60.62
"bove is shown the detailled statistics for the HTT& leaf, and you can see the accumulated rate,statistics of packets per seconds, but also the tokens accumulated, lended, borrowed, etc% this is
the most helpful output to diagnose your policy in depth.
= * ord a#out B8uffer8loatB
e mentionned that too large buffers can have a negative impact on the performances of a
connection. @ut how bad is it exactly V
The answer to that
-
7/24/2019 Journey to the Center of the Linux Kernel
22/25
latenc "spee,* meets o+r nee,s9 More o4 /hat o+ ,onWtnee, is +seless9H+44erbloat ,estros the spee, /e reall nee,9
1ore information on 3ettyss page, and in this paper from 7##S- ;ts the Latency, !tupid.
Long story short- if you have bad latency, but large bandwidth, you will be able to transfer verylarge files efficiently, but a simple /!
-
7/24/2019 Journey to the Center of the Linux Kernel
23/25
An is the number of buffers of N:#S bytes given to the socket.
# nttcp -t -B -n60&8000 ')69'.89'9660
"nd at the same time, on the laptop, launch a ping of the desktop.
.& btes 4rom ')69'.89'9660 icmpCre5' ttl.& time0900 ms
.& btes 4rom ')69'.89'9660 icmpCre56 ttl.& time098. ms
.& btes 4rom ')69'.89'9660 icmpCre5 ttl.& time')96 ms
.& btes 4rom ')69'.89'9660 icmpCre5& ttl.& time')96 ms
.& btes 4rom ')69'.89'9660 icmpCre52 ttl.& time')96 ms
.& btes 4rom ')69'.89'9660 icmpCre5. ttl.& time')96 ms
.& btes 4rom ')69'.89'9660 icmpCre5% ttl.& time')9 ms
.& btes 4rom ')69'.89'9660 icmpCre58 ttl.& time')90 ms
.& btes 4rom ')69'.89'9660 icmpCre5) ttl.& time0968' ms
.& btes 4rom ')69'.89'9660 icmpCre5'0 ttl.& time09.6 ms
The first two pings are launch before nttcp is launched. hen nttcp starts, the latency augments
but this is still acceptable.
/ow, reduce the speed of each network card on the desktop and the laptop to 7::1bips. Thecommand is-
#ethtool -s eth0 spee, '00 ,+ple1 4+ll
# ethtool eth0
-
7/24/2019 Journey to the Center of the Linux Kernel
24/25
collisions0 t15+e+elen'000
# ethtool -g eth0Ring parameters 4or eth0F999I!+rrent har,/are settingsF999IT 2''
e start by changing the tx
-
7/24/2019 Journey to the Center of the Linux Kernel
25/25
@ut while the TC& stack was filling up the T> buffers, all the other packets that our system
wanted to send got either stuck somewhere in the