velocity 2013 tcp tuning for the web
TRANSCRIPT
![Page 2: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/2.jpg)
Me
• Co-founder and Operations at Fastly
• Former Operations Engineer at Wikia
• Lots of Sysadmin and Linux consulting
Wednesday, June 19, 13
![Page 3: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/3.jpg)
Wednesday, June 19, 13
![Page 4: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/4.jpg)
The Goal
• Make the best use of our limited resources to deliver the best user experience
Wednesday, June 19, 13
![Page 5: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/5.jpg)
Focus
Wednesday, June 19, 13
![Page 6: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/6.jpg)
Linux
• I like it
• I use it
• It won’t hurt my feelings if you don’t
• Examples will be aimed primarily at linux
Wednesday, June 19, 13
![Page 7: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/7.jpg)
Small Requests
• Optimizing towards small objects like html and js/css
Wednesday, June 19, 13
![Page 8: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/8.jpg)
Just Enough TCP
• Not a deep dive into TCP
Wednesday, June 19, 13
![Page 9: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/9.jpg)
The accept() loop
Wednesday, June 19, 13
![Page 10: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/10.jpg)
Entry point from the kernel to your application
• Client sends SYN
• Kernel hands SYN to Server
• Server calls accept()
• Kernel sends SYN/ACK
• Client sends ACK
Wednesday, June 19, 13
![Page 11: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/11.jpg)
Backlog
• Number of connections allowed to be in a SYN state
• Kernel will drop new SYNs when this limit is hit
• Clients wait 3s before trying again, then 9s on second failure
Wednesday, June 19, 13
![Page 12: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/12.jpg)
Backlog Tuning (kernel side)
• net.ipv4.tcp_max_syn_backlog and net.core.somaxconn
• Default value of 1024 is for “systems with more than 128MB of memory”
• 64 bytes per entry
• Undocumented max of 65535
Wednesday, June 19, 13
![Page 13: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/13.jpg)
Backlog Tuning (app side)
• Set when you call listen()
• nginx, redis, apache default to 511
• mysql default of 50
Wednesday, June 19, 13
![Page 14: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/14.jpg)
DDoS Handling
Wednesday, June 19, 13
![Page 15: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/15.jpg)
The SYN Flood
• Resource exhaustion attack
• Cheaper for attacker than target
• Client sends SYN with bogus return address
• Until the ACK is completed the connection occupies a slot in the backlog queue
Wednesday, June 19, 13
![Page 16: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/16.jpg)
The SYN Cookie
• When enabled, kicks in when the backlog queue is full
• Sends a slightly more expensive but carefully crafted SYN/ACK then drops the SYN from the queue
• If the client responds to the SYN/ACK it can rebuild the original SYN and proceed
• Does disable large windows when active, but better than dropping entirely
Wednesday, June 19, 13
![Page 17: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/17.jpg)
Dealing with a SYN Flood
• Default to syncookies enabled and alert when triggered
• tcpdump/wireshark
• Frequently attacks have a detectable signature such as all having the same initial window size
• iptables is very flexible for matching these signatures, but can be expensive
• If hardware filters are available, use iptables to identify and hardware to block
Wednesday, June 19, 13
![Page 18: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/18.jpg)
The Joys of Hardware
Wednesday, June 19, 13
![Page 19: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/19.jpg)
Queues
• Modern network hardware is multi-queue
• By default assigns a queue per cpu core
• Should get even balancing of incoming requests, irqbalance can mess that up
• Intel ships a script with their drivers to aid in static assignment to avoid irqbalance
Wednesday, June 19, 13
![Page 20: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/20.jpg)
Packet Filters
• Intel and others have packet filters in their nics
• Small, only 128 wide in the intel 82599
• Much more limited matchers than iptables
• src,dst,type,port,vlan
Wednesday, June 19, 13
![Page 21: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/21.jpg)
Hardware Flow Director
• Same mechanism as filters, just includes a mapping destination
• Set affinity to put both queue and app on same core
• Good for things like SSH and BGPD
• Maintain access in the face of an attack on other services
Wednesday, June 19, 13
![Page 22: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/22.jpg)
TCP Offload Engines
Wednesday, June 19, 13
![Page 23: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/23.jpg)
Full Offload
• Not so good on the public net
• Limited buffer resources on card
• Black box from a security perspective
• Can break features like filtering and QoS
Wednesday, June 19, 13
![Page 24: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/24.jpg)
Partial Offload
• Better, but with their own caveats
Wednesday, June 19, 13
![Page 25: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/25.jpg)
Large Receive Offload
• Collapses packets into a larger buffer before handing to OS
• Great for large volume ingress, which is not http
• Doesn’t work with ipv6
Wednesday, June 19, 13
![Page 26: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/26.jpg)
Generic Receive Offload
• Similar to LRO, but more careful
• Will only merge “safe” packets into a buffer
• Will flush at timestamp tick
• Usually a win and you should test
Wednesday, June 19, 13
![Page 27: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/27.jpg)
TCP Segementation
• OS fills a 64KB buffer and produces a single header template
• NIC splits the buffer into segements and checksums before sending
• Can save lots of overhead, but not much of a win in small request/response cycles
Wednesday, June 19, 13
![Page 28: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/28.jpg)
TX/RX Checksumming
• For small packets there is almost no win here
Wednesday, June 19, 13
![Page 29: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/29.jpg)
Bonding
• Linux bonding driver uses a single queue, large bottleneck for high packet rates
• teaming driver should be better, userspace tools only worked in modern fedora core so gave up
• Myricom hardware can do bonding natively
Wednesday, June 19, 13
![Page 30: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/30.jpg)
TCP Slow Start
Wednesday, June 19, 13
![Page 31: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/31.jpg)
Why Slow Start
• Early TCP implementations allowed the sender to immediately send as much data as the client window allowed
• In 1986 the internet suffered first congestion collapse
• 1000x reduction in effective throughput
• 1988 Van Jacobsen proposes Slow Start
Wednesday, June 19, 13
![Page 32: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/32.jpg)
Slow Start
• Goal is to avoid putting more data in flight than there is bandwidth available
• Client sends a receive window size of how much data they can buffer
• Server sends an inital burst based on server inital congestion window
• Double the window size for each received ACK
• Increases until a packet drop or slow start threshold is reached
Wednesday, June 19, 13
![Page 33: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/33.jpg)
Tuning Slow Start
• Increase your congestion window
• 2.6.39+ defaults to 10
• 2.6.19+ can be set as a path attribute
• ip route change default via 172.16.0.1 dev eth0 proto static initcwnd 10
Wednesday, June 19, 13
![Page 34: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/34.jpg)
Proportional Rate Reduction
• Added in linux 3.2
• Prior to PRR a loss would halve to congestion window potentially below the slow start threshold
• PRR paces retransmits to smooth out
• Makes disabling net.ipv4.tcp_no_metrics_save safer
Wednesday, June 19, 13
![Page 35: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/35.jpg)
TCP Buffering
Wednesday, June 19, 13
![Page 36: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/36.jpg)
Throughput = Buffer Size / Latency
Wednesday, June 19, 13
![Page 37: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/37.jpg)
Buffer Tuning
• 16MB buffer at 50ms RTT = 320MB/s max rate
• Set as 3 values; min, default, and max
• net.ipv4.tcp_rmem = 4096 65536 16777216
• net.ipv4.tcp_wmem = 4096 65536 16777216
Wednesday, June 19, 13
![Page 38: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/38.jpg)
TIME_WAIT
• State entered after a server has closed the connection
• Kept around in case of delayed duplicate ACKs to our FIN
Wednesday, June 19, 13
![Page 39: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/39.jpg)
Busy servers collect a lot
• Default timeout is 2 * FIN timeout, so120s in linux
• Worth dropping FIN timeout
• net.ipv4.tcp_fin_timeout = 10
Wednesday, June 19, 13
![Page 40: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/40.jpg)
Tuning TIME_WAIT
• net.ipv4.tcp_tw_reuse=1
• Reuse sockets in TIME_WAIT if safe to do so
• net.ipv4.tcp_max_tw_buckets
• Default of 131072, way too small for most sites
• Connections/s * timeout value
Wednesday, June 19, 13
![Page 41: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/41.jpg)
net.ipv4.tcp_tw_recyle
•EVIL!Wednesday, June 19, 13
![Page 42: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/42.jpg)
net.ipv4.tcp_tw_reuse
• Allows the reuse of a TIME_WAIT socket if the client’s timestamp increases
• Silently drops SYNs from the client if they don’t, like can happen behind NAT
• Still useful for high churn servers when you can make assumptions of local network
Wednesday, June 19, 13
![Page 43: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/43.jpg)
SSL and Keepalives
• Enable them if at all possible
• The initial handshake is the most computationally intensive part
• 1-10ms on modern hardware
• 6 * RTT before user gets data really sucks over long distances
• SV to LON = 160ms = 1s before user starts getting content
Wednesday, June 19, 13
![Page 44: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/44.jpg)
SSL and Geo
• If you can move SSL termination closer to you users, do so
• Most CDNs offer this, even for non-cacheable things
• EC2 and Route53 is another option
Wednesday, June 19, 13
![Page 45: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/45.jpg)
In closing
• Upgrade your kernel
• Increase your initial congestion window
• Check your backlog and time wait limits
• Size your buffers to something reasonable
• Get closer to your users if you can
Wednesday, June 19, 13
![Page 46: Velocity 2013 TCP Tuning for the Web](https://reader033.vdocuments.mx/reader033/viewer/2022052823/55587658d8b42aaa7e8b5473/html5/thumbnails/46.jpg)
Thanks!Jason Cook@macros
Wednesday, June 19, 13