shivkumar kalyanaraman rensselaer polytechnic institute 1 tcp, tcp congestion control and common aqm...
DESCRIPTION
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Multiplexing / demultiplexing application transport network M P2 application transport network receiver H t H n segment M application transport network P1 MMM P3 P4 segment header application-layer data source port #dest port # 32 bits header/payload fieldsTRANSCRIPT
Shivkumar KalyanaramanRensselaer Polytechnic Institute
1
TCP, TCP Congestion Control and Common AQM Schemes: Quick
Revision
Shivkumar KalyanaramanRensselaer Polytechnic Institute
[email protected] http://www.ecse.rpi.edu/Homepages/shivkuma
Based in part upon slides of Prof. Raj Jain (OSU), Srini Seshan (CMU), J. Kurose (U Mass), I.Stoica (UCB)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
2
Quick introduction to TCP ServicesTCP Reliability Model, MechanismsTCP Congestion Control Model and
MechnismsTCP Versions: Reno, NewReno, SACK, Vegas
etcAQM schemes: common goals, RED, …
Overview
Shivkumar KalyanaramanRensselaer Polytechnic Institute
3
Multiplexing / demultiplexing
applicationtransportnetwork
M P2applicationtransportnetwork
receiver
HtHnsegment
segment Mapplicationtransportnetwork
P1M
M MP3 P4
segmentheader
application-layerdata
source port # dest port #
32 bits
header/payload fields
Shivkumar KalyanaramanRensselaer Polytechnic Institute
4
Checksum
Sender: Treat segment contents
as sequence of 16-bit integers
Checksum: addition (1’s complement sum) of segment contents
Sender puts checksum value into UDP checksum field
Receiver: Compute checksum of
received segment Check if computed
checksum equals checksum field value: NO - error detected YES - no error detected.
But maybe errors nonetheless?
Goal: detect “errors” (e.g., flipped bits) in transmitted segment (I.e., payload + header)
Note: IP only has a header checksum.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
5
Introduction to TCP Communication abstraction: close equivalent to
UNIX file-system interface => programmer productivity!ReliableOrderedPoint-to-pointByte-streamFull duplexFlow and congestion controlled
Shivkumar KalyanaramanRensselaer Polytechnic Institute
6
TCP Header
Source port Destination port
Sequence number
Acknowledgement
Advertised windowHdrLen Flags0
Checksum Urgent pointer
Options (variable)
Data
Flags: SYNFINRESETPUSHURGACK
Shivkumar KalyanaramanRensselaer Polytechnic Institute
7
Principles of Reliable Data Transfer
Characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
8
Temporal Redundancy ModelPackets • Sequence Numbers
• CRC or Checksum
Status Reports • ACKs• NAKs, • SACKs• Bitmaps
• Packets• FEC information
Retransmissions
Timeout
Shivkumar KalyanaramanRensselaer Polytechnic Institute
9
Types of errors and effects Forward channel bit-errors (garbled packets) Forward channel packet-errors (lost packets) Reverse channel bit-errors (garbled status reports) Reverse channel bit-errors (lost status reports)
Protocol-induced effects: Duplicate packets Duplicate status reports Out-of-order packets Out-of-order status reports Out-of-range packets/status reports (in window-based
transmissions)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
10
Reliability Mechanisms… Mechanisms:
Checksum: detects corruption in pkts & acks ACK: “packet correctly received” Duplicate ACK: “packet incorrectly received” Sequence number: identifies packet or ack
1-bit sequence number used both in forward & reverse channel Timeout only at sender
Provides reliable transmission over: An error-free channel A forward & reverse channel with bit-errors Detects duplicates of packets/acks NAKs eliminated Forward & reverse channel w/ packet-errors (loss)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
11
Example: Three-Way Handshake TCP connection-establishment: 3-way-handshake
necessary and sufficient for unambiguous setup/teardown even under conditions of loss, duplication, and delay
Shivkumar KalyanaramanRensselaer Polytechnic Institute
12
Stop-and-Wait (Handshake) Efficiency
Data
Ack
Ack
Data
tframe
tprop
=tprop
tframe
=Distance/Speed of SignalFrame size /Bit rate
=Distance Bit rateFrame size Speed of Signal
=1
2 + 1
U=2tprop+tframe
tframe
U
Light in vacuum = 300 m/sLight in fiber = 200 m/sElectricity = 250 m/s
No loss or bit-errors!
Shivkumar KalyanaramanRensselaer Polytechnic Institute
13
“Sliding Window” Protocols
Data
Ack
tframe
tprop
U=Ntframe
2tprop+tframe
=
N2+1
1 if N>2+1
Note: no loss or bit-errors!
Shivkumar KalyanaramanRensselaer Polytechnic Institute
14
ReceiverSender
Sliding Window: Details
… …
Sent & Acked Sent Not Acked
OK to Send Not Usable
… …
Max acceptable
Receiver window
Max ACK received Next seqnum
Received & Acked Acceptable Packet
Not Usable
Sender window
Next expected
Shivkumar KalyanaramanRensselaer Polytechnic Institute
15
acknowledged sent to be sentoutside window
Source Port Dest. PortSequence NumberAcknowledgment
HL/Flags WindowD. Checksum Urgent Pointer
Options..
Source Port Dest. PortSequence NumberAcknowledgment
HL/Flags WindowD. Checksum Urgent Pointer
Options..
Packet Sent Packet Received
App write
Window Flow Control: Header
Shivkumar KalyanaramanRensselaer Polytechnic Institute
16
Go-Back-N Retransmission If you hear of packet loss, retransmit the whole window! k-bit seq # in pkt header
Allows upto N = 2k – 1 packets in-flight, unacked ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”
Sender may receive duplicate ACKs Robust to ack losses on the reverse channel Can pinpoint the first packet lost, but cannot identify blocks of lost packets in window
One timer for oldest-in-flight pkt
Shivkumar KalyanaramanRensselaer Polytechnic Institute
17
Selective Repeat: Sender, Receiver Windows
Shivkumar KalyanaramanRensselaer Polytechnic Institute
18
Timeout and RTT Estimation Problem:
Unlike a physical link, the RTT of a logical link can vary, quite substantially
How long should timeout be ?Too long => underutilizationToo short => wasteful retransmissions
Solution: adaptive timeout: based on a good estimate of maximum current value of RTT
Shivkumar KalyanaramanRensselaer Polytechnic Institute
19
How to estimate max RTT? RTT = prop + queuing delay
Queuing delay highly variable So, different samples of RTTs will give different
random values of queuing delay Chebyshev’s Theorem:
MaxRTT = Avg RTT + k*Deviation Error probability is less than 1/(k**2) Result true for ANY distribution of samples
In TCP: Timeout = AverageRTT + 4*Deviation Rounded up to timer granularity (50-500 ms)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
20
Recap: Stability of a Multiplexed SystemAverage Input Rate > Average Output Rate
=> system is unstable!
How to ensure stability ?1. Reserve enough capacity so that
demand is less than reserved capacity 2. Dynamically detect overload and adapt
either the demand or capacity to resolve overload
Shivkumar KalyanaramanRensselaer Polytechnic Institute
21
Congestion Problem in Packet Switching
A
B
C10 MbsEthernet
1.5 Mbs
45 Mbs
D E
statistical multiplexing
queue of packetswaiting for output
link
If capacity is sized to be less than peak demand (statistical muxing!), need to either reserve resources or dynamically detect/adapt to overload for stability
Shivkumar KalyanaramanRensselaer Polytechnic Institute
22
Congestion: A Close-up View
knee – point after which throughput increases very
slowly delay increases fast
cliff – point after which throughput starts to decrease
very fast to zero (congestion collapse)
delay approaches infinity
Note (in an M/M/1 queue) delay = 1/(1 – utilization)
Load
Load
Thro
ughp
utD
elay
knee cliff
congestioncollapse
packetloss
Shivkumar KalyanaramanRensselaer Polytechnic Institute
23
Congestion Control vs. Congestion Avoidance
Congestion control goalstay left of cliff
Congestion avoidance goalstay left of knee
Right of cliff: Congestion collapse
Increase in network load results in decrease of useful work done
Load
Thro
ughp
ut
knee cliff
congestioncollapse
Shivkumar KalyanaramanRensselaer Polytechnic Institute
24
End-to-End Congestion Control 1. End-to-end model:
End-system estimate the timing and degree of congestion and reduces its demand appropriately
Intermediate nodes relied upon to send timely and appropriate penalty indications (eg: packet loss rate) during congestion
Enhanced routers could send more accurate congestion signals (eg: early congestion notifications, I.e. ECNs)
Key: trust and complexity resides at end-systems Issue: What about misbehaving flows?
Shivkumar KalyanaramanRensselaer Polytechnic Institute
25
Packet Conservation: Self-clockingPrPb
Ar
Ab
ReceiverSender
As
Implications of ack-clocking: More batching of acks => bursty traffic Less batching leads to a large fraction of Internet traffic being just acks (overhead)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
26
Additive Increase/Multiplicative Decrease (AIMD) Policy
Assumption: decrease policy must (at minimum) reverse the load increase over-and-above efficiency line Implication: decrease factor should be conservatively set to account for any
congestion detection lags etc
x0
x1
x2
Efficiency Line
Fairness Line
User 1’s Allocation x1
User 2’s Allocation
x2
Shivkumar KalyanaramanRensselaer Polytechnic Institute
27
TCP Congestion Control Maintains three variables:
cwnd – congestion window rcv_win – receiver advertised window ssthresh – threshold size (used to update
cwnd) Rough estimate of knee point…
For sending use: win = min(rcv_win, cwnd)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
28
TCP: Slow Start Whenever starting traffic on a new connection, or
whenever increasing traffic after congestion was experienced:
Set cwnd =1 Each time a segment is acknowledged
increment cwnd by one (cwnd++).
Does Slow Start increment slowly? Not really. In fact, the increase of cwnd is exponential!! Window increases to W in RTT * log2(W)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
29
Slow Start Example The congestion
window size grows very rapidly
TCP slows down the increase of cwnd when cwnd >= ssthresh
ACK for segment 1
segment 1cwnd = 1
cwnd = 2 segment 2segment 3
ACK for segments 2 + 3
cwnd = 4 segment 4segment 5segment 6segment 7
ACK for segments 4+5+6+7
cwnd = 8
Shivkumar KalyanaramanRensselaer Polytechnic Institute
30
Slow Start Sequence Plot
Time
Sequence No
.
.
.
Window doubles every round
Shivkumar KalyanaramanRensselaer Polytechnic Institute
31
Congestion Avoidance Goal: maintain operating point at the left of the cliff: How?
additive increase: starting from the rough estimate (ssthresh), slowly increase cwnd to probe for additional available bandwidth
multiplicative decrease: cut congestion window size aggressively if a loss is detected.
If cwnd > ssthresh then each time a segment is acknowledged increment cwnd by 1/cwnd
i.e. (cwnd += 1/cwnd).
Shivkumar KalyanaramanRensselaer Polytechnic Institute
32
Congestion Avoidance Sequence Plot
Time
Sequence No Window growsby 1 every round
Shivkumar KalyanaramanRensselaer Polytechnic Institute
33
Slow Start/Congestion Avoidance Eg. Assume that
ssthresh = 8cwnd = 1
cwnd = 2
cwnd = 4
cwnd = 8
cwnd = 9
cwnd = 10
02468
101214
t=0 t=2 t=4 t=6
Roundtrip times
Cw
nd (i
n se
gmen
ts)
ssthresh
Shivkumar KalyanaramanRensselaer Polytechnic Institute
34
Putting Everything Together:TCP Pseudo-code
Initially:cwnd = 1;ssthresh = infinite;
New ack received:if (cwnd < ssthresh) /* Slow Start*/ cwnd = cwnd + 1;else /* Congestion Avoidance */ cwnd = cwnd + 1/cwnd;
Timeout: (loss detection)/* Multiplicative decrease */ssthresh = win/2;cwnd = 1;
while (next < unack + win)transmit next packet;
where win = min(cwnd, flow_win);
unack next
win
seq #
Shivkumar KalyanaramanRensselaer Polytechnic Institute
35
The big picture
Time
cwnd
Timeout
Slow Start
CongestionAvoidance
Shivkumar KalyanaramanRensselaer Polytechnic Institute
36
Packet Loss Detection: Timeout Avoidance
Wait for Retransmission Time Out (RTO) What’s the problem with this?
Because RTO is a performance killer In BSD TCP implementation, RTO is usually more than 1
second the granularity of RTT estimate is 500 ms retransmission timeout is at least two times of RTT
Solution: Don’t wait for RTO to expire Use alternate mechanism for loss detection Fall back to RTO only if these alternate mechanisms
fail.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
37
Fast Retransmit Resend a segment after
3 duplicate ACKsRecall: a duplicate
ACK means that an out-of sequence segment was received
Notes: duplicate ACKs due
packet reordering! if window is small
don’t get duplicate ACKs!
ACK 1
segment 1cwnd = 1
cwnd = 2 segment 2segment 3
ACK 3cwnd = 4 segment 4
segment 5segment 6segment 7
ACK 1
3 duplicateACKs
ACK 4
ACK 4
ACK 4
Shivkumar KalyanaramanRensselaer Polytechnic Institute
38
Fast Recovery (Simplified) After a fast-retransmit set cwnd to ssthresh/2
i.e., don’t reset cwnd to 1 But when RTO expires still do cwnd = 1
Fast Retransmit and Fast Recovery implemented by TCP Reno; most widely used version of TCP today
Shivkumar KalyanaramanRensselaer Polytechnic Institute
39
Fast Retransmit and Fast Recovery
Retransmit after 3 duplicated acks prevent expensive timeouts
No need to slow start again At steady state, cwnd oscillates around the
optimal window size.
Time
cwnd
Slow Start
CongestionAvoidance
Shivkumar KalyanaramanRensselaer Polytechnic Institute
40
Fast Retransmit
Time
Sequence NoDuplicate Acks
RetransmissionX
Shivkumar KalyanaramanRensselaer Polytechnic Institute
41
Multiple Losses
Time
Sequence NoDuplicate Acks
RetransmissionX
X
XX Now what? (TCP Versions)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
42
Time
Sequence NoX
X
XX
TCP Versions: TahoeTahoe: set window to 1, and do slow start! No timeout…
Shivkumar KalyanaramanRensselaer Polytechnic Institute
43
TCP Versions: Reno
Time
Sequence NoX
X
XX
Now what? - timeout
Reno: Recover 1 packet loss ok, but multiple loss => timeout
Shivkumar KalyanaramanRensselaer Polytechnic Institute
44
TCP Reno (Jacobson 1990)
SStime
window
CA
SS: Slow StartCA: Congestion Avoidance Fast retransmission/fast recovery
Shivkumar KalyanaramanRensselaer Polytechnic Institute
45
NewReno
Time
Sequence NoX
X
XX
Now what? – partial ackrecovery
NewReno: Recover multiple losses in successive RTTs using notion of partial ack”. No timeout.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
46
SACK Basic problem is that cumulative acks only provide little
information Alt: Selective Ack for just the packet received
What if selective acks are lost? carry cumulative ack also!
Implementation: Bitmask of packets received Selective acknowledgement (SACK) Only provided as an optimization for retransmission Fall back to cumulative acks to guarantee correctness
and window updates
Shivkumar KalyanaramanRensselaer Polytechnic Institute
47
SACK
Time
Sequence NoX
X
XX
Now what? – sendretransmissions as soonas detected
Shivkumar KalyanaramanRensselaer Polytechnic Institute
48
TCP Congestion Control Summary Sliding window limited by receiver window. Dynamic windows: slow start (exponential rise),
congestion avoidance (additive rise), multiplicative decrease. Ack clocking
Adaptive timeout: need mean RTT & deviation Timer backoff and Karn’s algo during retransmission
Go-back-N or Selective retransmission Cumulative and Selective acknowledgements Timeout avoidance: Fast Retransmit
Shivkumar KalyanaramanRensselaer Polytechnic Institute
49
Queuing Disciplines Each router must implement some queuing discipline Queuing allocates bandwidth and buffer space:
Bandwidth: which packet to serve next (scheduling) Buffer space: which packet to drop next (buff mgmt)
Queuing also affects latency
Class C
Class BClass A
Traffic Classes
Traffic Sources
DropScheduling Buffer Management
Shivkumar KalyanaramanRensselaer Polytechnic Institute
50
Typical Internet Queuing FIFO + drop-tail
Simplest choice Used widely in the Internet
FIFO (first-in-first-out) Implies single class of traffic
Drop-tail Arriving packets get dropped when queue is full
regardless of flow or importance Important distinction:
FIFO: scheduling discipline Drop-tail: drop (buffer management) policy
Shivkumar KalyanaramanRensselaer Polytechnic Institute
51
FIFO + Drop-tail Problems FIFO Issues: In a FIFO discipline, the service seen by a flow
is convoluted with the arrivals of packets from all other flows! No isolation between flows: full burden on e2e control No policing: send more packets get more service
Drop-tail issues: Routers are forced to have have large queues to maintain
high utilizations Larger buffers => larger steady state queues/delays Synchronization: end hosts react to same events
because packets tend to be lost in bursts Lock-out: a side effect of burstiness and synchronization
is that a few flows can monopolize queue space
Shivkumar KalyanaramanRensselaer Polytechnic Institute
52
Queue Management Ideas Synchronization, lock-out:
Random drop: drop a randomly chosen packet Drop front: drop packet from head of queue
High steady-state queuing vs burstiness: Early drop: Drop packets before queue full Do not drop packets “too early” because queue may reflect only
burstiness and not true overload Misbehaving vs Fragile flows:
Drop packets proportional to queue occupancy of flow Try to protect fragile flows from packet loss (eg: color them or
classify them on the fly) Drop packets vs Mark packets:
Dropping packets interacts w/ reliability mechanisms Mark packets: need to trust end-systems to respond!
Shivkumar KalyanaramanRensselaer Polytechnic Institute
53
Packet Drop Dimensions
AggregationPer-connection state Single class
Drop positionHead Tail
Random location
Class-based queuing
Early drop Overflow drop
Shivkumar KalyanaramanRensselaer Polytechnic Institute
54
Random Early Detection (RED)Min threshMax thresh
Average Queue Length
minth maxth
maxP
1.0
Avg queue length
P(drop)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
55
Random Early Detection (RED) Maintain running average of queue length
Low pass filtering If avg Q < minth do nothing
Low queuing, send packets through If avg Q > maxth, drop packet
Protection from misbehaving sources Else mark (or drop) packet in a manner proportional to
queue length & bias to protect against synchronization Pb = maxp(avg - minth) / (maxth - minth) Further, bias Pb by history of unmarked packets Pa = Pb/(1 - count*Pb)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
56
RED Issues Issues:
Breaks synchronization well Extremely sensitive to parameter settings Wild queue oscillations upon load changes Fail to prevent buffer overflow as #sources increases Does not help fragile flows (eg: small window flows or
retransmitted packets) Does not adequately isolate cooperative flows from non-
cooperative flows Isolation:
Fair queuing achieves isolation using per-flow state RED penalty box: Monitor history for packet drops,
identify flows that use disproportionate bandwidth
Shivkumar KalyanaramanRensselaer Polytechnic Institute
57
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Link congestion measure
Link
mar
king
pro
babi
lity
REM Athuraliya & Low 2000
Main ideasDecouple congestion & performance measure “Price” adjusted to match rate and clear bufferMarking probability exponential in `price’
REM RED
Avg queue
1
Shivkumar KalyanaramanRensselaer Polytechnic Institute
58
Comparison of AQM Performance
DropTailqueue = 94%
REDmin_th = 10 pktsmax_th = 40 pktsmax_p = 0.1
REMqueue = 1.5 pktsutilization = 92% = 0.05, = 0.4, = 1.15
Shivkumar KalyanaramanRensselaer Polytechnic Institute
59
Area = 2w2/3
What is TCP Throughput?
Each cycle delivers 2w2/3 packets Assume: each cycle delivers 1/p packets = 2w2/3
Delivers 1/p packets followed by a drop=> Loss probability = p/(1+p) ~ p if p is small.
Hence
t
window
2w/3
w = (4w/3+2w/3)/2
4w/3
2w/3
pw 2/3
Shivkumar KalyanaramanRensselaer Polytechnic Institute
60
Law
Equilibrium window size
Equilibrium rate
Empirically constant a ~ 1 Verified extensively through simulations and on Internet References
T.J.Ott, J.H.B. Kemperman and M.Mathis (1996)M.Mathis, J.Semke, J.Mahdavi, T.Ott (1997)T.V.Lakshman and U.Mahdow (1997)J.Padhye, V.Firoiu, D.Towsley, J.Kurose (1998)
p1
paw
s
pDax
s
s