linux networking: the rise of the congestion window, the fall...
TRANSCRIPT
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
Linux Networking: The RISE of the congestionwindow, the FALL of the routing cache, and the
LOCALITY of packets.
David S. Miller
Red Hat Inc.
IBM Watson Research Center, 2010
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
ROUTING CACHE: WHAT IS IT?
Hash table based cache of routing lookups.Keyed on many attributes
src and dest addressTOSdevice indexetc.
Assumes real route lookup is (relatively) slowReal route lookup is layered (f.e. policy routing)
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
ROUTING CACHE: PROBLEMS
Routing table to routing cache is one to manyEntries are created in response to packetsPrime target or focus for DOS attacksMitigation strategies:
Secure hash keysGarbage collection
GC is very non-deterministic and hard to tuneRouting table changes require careful cache flushing
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
ROUTING CACHE: WHAT BACKS IT?
Original algorithm, array of hash tablesOne hash table per prefix length (0 –> 32)Not the most optimal, but routing cache makes this OKRelatively simple
New algorithm, LC-trieMulti-dimensional trieClose to what’s known to be optimalComplicatedPerformance tied to trie balancing heuristics
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
ROUTING CACHE: BARRIERS TO REMOVAL
Mainly performance, cache handled DOS attacks betterNo longer true after Eric Dumazet’s workHandling of metrics
Move to existing inetpeer cacheIssues of metric granularity
Storing of route lookup “result”IPSEC and route stackingAnd again, performance...
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
ROUTING CACHE: SIDE TOPICS
What does BSD do?Uses a patricia tree.Clones are created for specific routes.
What does our IPV6 stack do?See BSD above.But with support for source address keying.Thus two tiered tree layout.
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
TCP CWND: HOW TO KILL THE INTERNET
CWND == congestion windowIronically by keeping things as they are now.Initial CWND has stayed constant for more than adecade.Meanwhile net capacity has increased dramatically.Current situation is a bit of a joke.
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
TCP CWND: SHORT TUTORIAL
Connections start with an initial CWNDIncreased until loss is detectedCWND is reduced at loss eventsProcess repeatsCritical aspect: aggressive probing of network capacity
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
TCP CWND: THE BIG MYTH
That we actually have an initial CWNDActually there is no real limitApplications can have as large of one as they wantOpenning up several connections at onceN connections == “initial CWND X N”
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
TCP CWND: GOOGLE’S PROPOSAL
“draft-hkchu-tcpm-initcwnd-00“Increase initial CWND to 10 packetsMost web objects do not fit into existing initial CWNDWith 10 packets, most will fitWorks well with technologies such as SPDY
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
TCP CWND: KNEE JERK REACTIONS
This increase will cause congestion collapseFALSE: Congestion avoidance still at workTCP will still back off in the event of loss
It will hurt clients with smaller pipesFALSE: Smaller pipe end hosts get better performanceThe key is ACK clocking and how fast recovery works3+ ACKs are necessary to trigger fast recoveryWith old initial CWND that never happened at start
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
TCP CWND: ANYWAYS...
Linux will adopt larger initial CWND real soonNothing IETF can do about it (sorry Chicken Little thesky is not falling)You heard it here first
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: SYSTEM HIERARCHY
Welcome to the NUMA worldMemory “distance” matters more than everNo longer a quaint optimization for “huge” serversNUMA is pervasive even on desktopsHeck, even laptops...
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: MULTIQUEUE NETWORKING
Old systems, single RX queue, single TX queueLimited by event signaling in old PCIWelcome PCI-E and MSI-X interruptsNetworking cards beging to have multi-Q functionalityNow it’s pervasive
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: LINUX SUPPORT FOR HARDWARE
MULTI-Q
Stephen Hemminger’s NAPI split-up workPull NAPI state out of struct netdevOne NAPI instance per HW interrupt source
Making TX path multi-Q capablePull queue flow control state ouf of struct netdevDealing with qdiscs.... ugh...Only simplest qdiscs are fully multi-QComplex qdiscs force synchronization at the qdiscIn the future token based qdiscs (SFQ, etc.) can bemulti-Q tooHierarchical qdiscs fundamentally cannot (HFSC, HTB,etc.)Create new multi-Q qdisc for high level flowmanagement
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: SOFTWARE MULTI-Q
Use software facilities to implement multi-QCPU cross calls and packet processing job placementWhy even bother?
Lots of non-multi-Q capable hardware out thereHardware multi-Q is stateless (as it should be)Software schemes provide more flexibilityPosibility to optimize for application locality
Initially I was against.Happily, Tom Herbert was able to convince me.
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: STAGE ONE: RPS
Receive Packet Steering by Tom HerbertStateless flow seperationPerfectly mimicks hardware multi-Q on RXEach hardware RX queue has a configurable cpumaskPackets received on RX queue hash to CPU in thatmask
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: STAGE TWO: RFS
Receive Flow Steering, again from Tom HerbertHash table of flow to CPU mappingsDynamically updatedKernel spies on application I/O callsCPU of I/O call becomes flow CPU mappingTable is sized and enabled via sysctlIssue: out-of-order packet delivery avoidance
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: STAGE THREE: XPS
Tom now gives us Transmit Packet SteeringTransmit side localityMaps cpus to transmit queues, reverse of RPSData structure localityLikelyhood packet free happens near sending threadEric Dumazet’s Transmit Completion Steering patch
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
LOCALITY: FUTURES
Hardware assist for RPS/RFS (Ben Hutchings)Test patches exist for SFC chipsMakes use of on-chip flow table facilities
Having lots and lots of hardware queuesNegative matching for things like GROSteering “queues” themselves instead of flows
Better defaults (all SW stuff off by defualt at themoment)
LinuxNetworking:The RISE of
thecongestionwindow, theFALL of the
routingcache, and
theLOCALITY of
packets.
DavidS. Miller
Death of theRoutingCache
Rise of theTCPCongestionWindow
Locality ofPackets
The End
THE END
Thanks to:Erich Nahum and IBM Watson Research CenterOren LaadanStephen HemmingerEric Dumazet (AKA: The Networking Ninja)Ben HutchingsTom Herbert and GoogleLinus Torvalds