tcp congestion signatures - sigcommtcp’s rtt congestion signatures • flows experiencing...
TRANSCRIPT
w w w .caida.org
TCP CONGESTION SIGNATURES
1
Srikanth Sundaresan (Princeton Univ.)Amogh Dhamdhere (CAIDA/UCSD)
kc Claffy (CAIDA/UCSD)Mark Allman (ICSI)
w w w .caida.or
Typical Speed Tests Don’t Tell Us Much
2
w w w .caida.or
Typical Speed Tests Don’t Tell Us Much
2
w w w .caida.or
Typical Speed Tests Don’t Tell Us Much
2
w w w .caida.or
Typical Speed Tests Don’t Tell Us Much
• Upload and download throughput measurements: no information beyond that
2
w w w .caida.or
Typical Speed Tests Don’t Tell Us Much
2
What type of congestion did the TCP flow experience?
w w w .caida.or
Two Potential Sources of Congestion in the End-to-end Path
3
w w w .caida.or
Two Potential Sources of Congestion in the End-to-end Path
• Self-induced congestion
- Clear path, the flow itself induced congestion
- eg: last-mile access link
3
w w w .caida.or
Two Potential Sources of Congestion in the End-to-end Path
• Self-induced congestion
- Clear path, the flow itself induced congestion
- eg: last-mile access link
• External congestion
- Flow starts on an already congested path
- eg: congested interconnect
3
w w w .caida.or
Two Potential Sources of Congestion in the End-to-end Path
• Self-induced congestion
- Clear path, the flow itself induced congestion
- eg: last-mile access link
• External congestion
- Flow starts on an already congested path
- eg: congested interconnect
3
Distinguishing the two cases has implications for users / ISPs / regulators
w w w .caida.or
How can we distinguish the two?
• Cannot distinguish using just throughput numbers
- Access plan rates vary widely, and are typically not available to content / speed test providers
- eg: Speed test reports 5 Mbps – is that the access link rate (DSL), or a congested path?
4
w w w .caida.or
How can we distinguish the two?
• Cannot distinguish using just throughput numbers
- Access plan rates vary widely, and are typically not available to content / speed test providers
- eg: Speed test reports 5 Mbps – is that the access link rate (DSL), or a congested path?
4
We can use the dynamics of TCP’s startup phase, i.e., Congestion Signatures
w w w .caida.or
TCP’s RTT Congestion Signatures
5
w w w .caida.or
TCP’s RTT Congestion Signatures
• Flows experiencing self-induced congestion fill up an empty buffer during slow start
- Hence increase the TCP flow RTT
5
w w w .caida.or
TCP’s RTT Congestion Signatures
• Flows experiencing self-induced congestion fill up an empty buffer during slow start
- Hence increase the TCP flow RTT
• Externally congested flows encounter an already full buffer
- Less potential for RTT increases
5
w w w .caida.or
TCP’s RTT Congestion Signatures
• Flows experiencing self-induced congestion fill up an empty buffer during slow start
- Hence increase the TCP flow RTT
• Externally congested flows encounter an already full buffer
- Less potential for RTT increases
• Self-induced congestion therefore has higher RTT variance compared to external congestion
5
w w w .caida.or
TCP’s RTT Congestion Signatures
• Flows experiencing self-induced congestion fill up an empty buffer during slow start
- Hence increase the TCP flow RTT
• Externally congested flows encounter an already full buffer
- Less potential for RTT increases
• Self-induced congestion therefore has higher RTT variance compared to external congestion
5
We can quantify this using Max-Min and CoV of RTT
w w w .caida.or
Example Controlled Experiment• 20 Mbps “access” link
with 100 ms buffer
• 1 Gbps “interconnect” link with 50 ms buffer
• Self-induced congestion flows have higher values for both metrics and are clearly distinguishable
Max-Min RTT
6
CoV RTT
101 1020.0
0.2
0.4
0.6
0.8
1.0
CDF
ExternalSelf
10−2 10−1 1000.0
0.2
0.4
0.6
0.8
1.0
CDF
ExternalSelf
w w w .caida.or
Example Controlled Experiment• 20 Mbps “access” link
with 100 ms buffer
• 1 Gbps “interconnect” link with 50 ms buffer
• Self-induced congestion flows have higher values for both metrics and are clearly distinguishable
The two types of congestion exhibit widely contrasting behaviors
Max-Min RTT
6
CoV RTT
101 1020.0
0.2
0.4
0.6
0.8
1.0
CDF
ExternalSelf
10−2 10−1 1000.0
0.2
0.4
0.6
0.8
1.0
CDF
ExternalSelf
w w w .caida.or
Model
• Max-min and CoV of RTT derived from RTT samples during slow start
• We feed the two metrics into a simple Decision Tree
- We control the depth of the tree to a low value to minimize complexity
• We build the decision tree classifier using controlled experiments and apply it to real-world data
7
w w w .caida.or
Validating the Method: Step 1- Controlled Experiments
8
Internet
R2R1
Server 1Server 2
Server 3
Server 4
Pi 1
Pi 2
100 Mbps
Shaped “access”
1 Gbps
w w w .caida.or
Validating the Method: Step 1- Controlled Experiments
8
Internet
R2R1
Server 1Server 2
Server 3
Server 4
Pi 1
Pi 2
100 Mbps
Shaped “access”
1 Gbps
Background cross-traffic
Interconnectcross-traffic
w w w .caida.or
Validating the Method: Step 1- Controlled Experiments
8
Internet
R2R1
Server 1Server 2
Server 3
Server 4
Pi 1
Pi 2
100 Mbps
Shaped “access”
1 Gbps
Throughput tests
Background cross-traffic
Interconnectcross-traffic
w w w .caida.or
It’s Real
9
w w w .caida.or
It’s Real
9
FantasticCablingeffort
Post-itdefined
networking
w w w .caida.or
Validating the Method: Step 1- Controlled Experiments
• Emulated access link + “core” link
- Wide range of access link throughputs, buffer sizes, loss rates, cross-traffic (background and congestion-inducing)
- Can accurately label flows in training data as “self ” or “externally” congested
10
Internet
R2R1
Server 1Server 2
Server 3
Server 4
Pi 1
Pi 2
100 Mbps
Shaped “access”
1 Gbps
Throughput tests
Background cross-traffic
Interconnectcross-traffic
w w w .caida.or
Validating the Method: Step 1- Controlled Experiments
11
Internet
R2R1
Server 1Server 2
Server 3
Server 4
Pi 1
Pi 2
100 Mbps
Shaped “access”
1 Gbps
Throughput tests
Background cross-traffic
Interconnectcross-traffic
High accuracy: precision and recall > 80%robust to model settings
w w w .caida.or
Validating the Method: Step 2
• From Ark VP in ISP A identified congested link with ISP B using TSLP*
12*Luckie et al. “Challenges in Inferring Internet Interdomain Congestion”, IMC 2014
ISP A
ISP B
Ark VP
w w w .caida.or
Validating the Method: Step 2
• From Ark VP in ISP A identified congested link with ISP B using TSLP*
12*Luckie et al. “Challenges in Inferring Internet Interdomain Congestion”, IMC 2014
ISP A
ISP B
congested link
Ark VP
w w w .caida.or
Validating the Method: Step 2
13
ISP A
ISP B
M-lab NDT server
congested link
Ark VP
• Periodic NDT tests from Ark VP to M-Lab NDT server “behind” the congested interdomain link
w w w .caida.or
Validation of the Method: Step 2
14
0 5
10 15 20 25 30
02/18 02/25 03/04 03/11
d/l M
bps
10 20 30 40 50 60 70
02/18 02/25 03/04 03/11TSLP
late
ncy
(far s
ide)
Strong correlation between throughput and TSLP latency: flows during elevated TSLP latency
labeled as “externally” congested
w w w .caida.or
Validation of the Method: Step 2
14
0 5
10 15 20 25 30
02/18 02/25 03/04 03/11
d/l M
bps
10 20 30 40 50 60 70
02/18 02/25 03/04 03/11TSLP
late
ncy
(far s
ide)
Strong correlation between throughput and TSLP latency: flows during elevated TSLP latency
labeled as “externally” congested
“Externally”congested
w w w .caida.or
Validation of the Method: Step 2
14
0 5
10 15 20 25 30
02/18 02/25 03/04 03/11
d/l M
bps
10 20 30 40 50 60 70
02/18 02/25 03/04 03/11TSLP
late
ncy
(far s
ide)
Strong correlation between throughput and TSLP latency: flows during elevated TSLP latency
labeled as “externally” congested
“Externally”congested
“self”congested
w w w .caida.or
Validation of the Method: Step 2
15
0 5
10 15 20 25 30
02/18 02/25 03/04 03/11
d/l M
bps
10 20 30 40 50 60 70
02/18 02/25 03/04 03/11TSLP
late
ncy
(far s
ide)
75%+ accuracy in detecting external congestion, 100% accuracy for self-induced congestion
w w w .caida.or
Validation of the Method: Step 3
• We use Measurement Lab’s NDT test data for real-world validation
• Cogent interconnect issue in late 2013/early 2014
- NDT tests to Cogent servers saw significant drops in throughput during peak hours
- Several major U.S. ISPs were affected, except Cox
- The problem was identified as congested interconnects
16
w w w .caida.or
Using the M-lab Data
17
January 2014
April 2014
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
w w w .caida.or
Using the M-lab Data
17
January 2014
Drop in peak-hour throughput for for Comcast, TWC, Verizon
April 2014
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
w w w .caida.or
Using the M-lab Data
17
January 2014
Cox not affected
Drop in peak-hour throughput for for Comcast, TWC, Verizon
April 2014
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
w w w .caida.or
Using the M-lab Data
17
January 2014
Cox not affected
Drop in peak-hour throughput for for Comcast, TWC, Verizon
April 2014
Interconnection dispute resolved; no diurnal effect
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
w w w .caida.or
Using the M-lab Data
18
Peak hour tests inJan/Feb 2014 are likely “externally” congested
Off-peak tests in Mar/Apr 2014 are likely
“self” congested
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
0 5 10 15 20
Hour of day (local)
10
20
30
40
Mbp
s
ComcastCox
TimeWarnerVerizon
w w w .caida.or
But didn’t you just say it’s hard to infer congestion using throughput tests??
19
w w w .caida.or
But didn’t you just say it’s hard to infer congestion using throughput tests??
• Yes :)
19
w w w .caida.or
But didn’t you just say it’s hard to infer congestion using throughput tests??
• Yes :)
• For that reason, our labeling is broad and coarse. All tests labeled “external” may not be traversing congested interconnects
19
w w w .caida.or
But didn’t you just say it’s hard to infer congestion using throughput tests??
• Yes :)
• For that reason, our labeling is broad and coarse. All tests labeled “external” may not be traversing congested interconnects
• We do not expect the technique to identify all peak hour tests as externally congested, and vice versa
- Looking for qualitative differences
19
w w w .caida.or
But didn’t you just say it’s hard to infer congestion using throughput tests??
• Yes :)
• For that reason, our labeling is broad and coarse. All tests labeled “external” may not be traversing congested interconnects
• We do not expect the technique to identify all peak hour tests as externally congested, and vice versa
- Looking for qualitative differences
• The general observations about congestion were verified by other sources, e.g., CAIDA’s TSLP measurements
19
w w w .caida.or
Applying the Model to M-lab data
20
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
0.2
0.4
0.6
0.8
1.0
%se
lf-in
duce
dco
nges
tion
Cogent (LAX) Cogent (LGA) Level3 (ATL)
Jan-Feb Mar-Apr
w w w .caida.or
Applying the Model to M-lab data
20
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
0.2
0.4
0.6
0.8
1.0
%se
lf-in
duce
dco
nges
tion
Cogent (LAX) Cogent (LGA) Level3 (ATL)
Jan-Feb Mar-Apr
Much lower incidences of self-induced congestion for Cogent in Jan/Feb 2014 as
compared to Mar/Apr
w w w .caida.or
Applying the Model to M-lab data
21
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
0.2
0.4
0.6
0.8
1.0
%se
lf-in
duce
dco
nges
tion
Cogent (LAX) Cogent (LGA) Level3 (ATL)
Jan-Feb Mar-Apr
Level3 does not show significant differences, was not affected by
interconnection disputes
w w w .caida.or
Applying the Model to M-lab data
22
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
Comcast
TimeWarnerVerizon Cox
0.2
0.4
0.6
0.8
1.0
%se
lf-in
duce
dco
nges
tion
Cogent (LAX) Cogent (LGA) Level3 (ATL)
Jan-Feb Mar-Apr
Cox does not show significant differences, was not affected by interconnection disputes
w w w .caida.or
Looking at Throughput
• What throughput should we observe for “self ” and “external” congested flows?
• With congested interconnects affecting many flows, both “self ” and “external” should see similar throughput
• Without congested interconnects affecting many flows, “self ” congested throughput should follow access link speeds, generally higher than “externally” congested
23
w w w .caida.or
Looking at Throughput
• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)
24
Comcast TimeWarner Verizon Cox0
5
10
15
20
Thro
ughp
ut(M
bps)
Jan-Feb SelfJan-Feb External
Mar-Apr SelfMar-Apr External
w w w .caida.or
Looking at Throughput
• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)
24
Comcast TimeWarner Verizon Cox0
5
10
15
20
Thro
ughp
ut(M
bps)
Jan-Feb SelfJan-Feb External
Mar-Apr SelfMar-Apr External
Both “self”and
“external”get similarthroughput
w w w .caida.or
Looking at Throughput
• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)
25
Comcast TimeWarner Verizon Cox0
5
10
15
20
Thro
ughp
ut(M
bps)
Jan-Feb SelfJan-Feb External
Mar-Apr SelfMar-Apr External
w w w .caida.or
Looking at Throughput
• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)
25
Comcast TimeWarner Verizon Cox0
5
10
15
20
Thro
ughp
ut(M
bps)
Jan-Feb SelfJan-Feb External
Mar-Apr SelfMar-Apr External
“Self” gethigher
throughputthan
“external”
w w w .caida.or
Takeaways
• It is possible to distinguish two kinds of congestion: self-induced vs. externally congested
• The difference is important to identify the solution
- Upgrade service plan? Or talk to ISP?
- Also for regulatory purposes
• Simple, accurate technique using RTT during TCP slow start dynamics
- Can be easily computed using packet captures or other tools such as Web100 (future work)
26
w w w .caida.or
Limitations
• Relies on buffering effect
- May not work on TCP variants that minimize buffer occupancy, e.g., BBR
• Only uses slow start dynamics
- Might be confounded by flows that perform one way during slow start but differently afterward
• Real-world validation relies on coarsely labeled data
- It would be great to validate on more real-world data!
27
w w w .caida.or
Thanks!Questions?
28