18 november 2015 1 root cause analysis of tcp throughput: methodology, techniques, and applications...
TRANSCRIPT
1 April 20, 2023
Root Cause Analysis of TCP Throughput:
Methodology, Techniques, and Applications
Matti Siekkinen
Ph.D. DefenseOctober 30, 2006
Institut EurecomSophia Antipolis, France
2 April 20, 2023
Outline Introduction and Motivation
Root cause analysis of TCP throughput: what and why?
Part 1: Methodology InTraBase: Integrated Traffic Analysis Based on Object
Relational DBMS
Part 2: Root cause analysis techniques Taxonomy of TCP rate limitation causes Our approach to infer limitation causes
Part 3: Case study on Performance Analysis of ADSL Clients
Conclusions Contributions Future work
3 April 20, 2023
The Internet: over the last 5 years…
Traffic volumes and number of users have skyrocketed
Access link capacities have multiplied Dominance shifted from Web+FTP into Peer-to-peer
applications TCP still the dominating transport protocol
Carries over 90% of traffic
4 April 20, 2023
The Internet: questions raised
ISPs would like to know how clients are doing What are the performance limitations that Internet
applications are facing? Why does a client with 4Mbit/s ADSL access obtain
only total download rate of few KB/s with eDonkey? Why, after upgrading my link, I see no improvement
in throughput? Internet does not provide directly answers
The network is dumb!
Need techniques for traffic measurement and analysis
5 April 20, 2023
Root Cause Analysis of TCP Throughput
What? Analysis and inference of the reasons that prevent a given
TCP connection from achieving a higher throughput. Reasons are called limitation causes
Why TCP? TCP typically over 90% of all traffic
6 April 20, 2023
Background
TCP Rate Analysis Tool (T-RAT) by Zhang et al. (sigcomm 2002) Pioneering research work
Ground breaking insights It is not all congestion! Opened up many questions
We implemented and tested it Results are way off too often Fundamental assumptions do not hold
T-RAT analyzes unidirectional traffic Passively collected measurements Usable in more cases (asymmetric paths) The source of the problems
7 April 20, 2023
Our approach
We analyze only passive traffic measurements Capture and store all TCP/IP headers, analyze later off-line
Observe traffic at a single measurement point Applicable in diverse situations E.g. at the edge of an ISP’s network
Know all about clients’ downloads and uploads
Bidirectional packet traces
Connection level analysis
8 April 20, 2023
Single measurement point anywhere along the path Cannot/don’t want to control it Complicates estimation of parameters (RTT and cwnd)
Challenges (1/3)
A: RTT ~ d1 piece of cake…B: RTT ~ d3+d4
How to get d4? (Did ack2 trigger data2?)
ack2
A B
9 April 20, 2023
Challenges (2/3)
A lot of data to analyze Potentially millions of connections per trace
Deep analysis For each connection of each trace
Compute a lot of metrics Divide connections into pieces
• Analyse separately and compute more metrics Need to keep track of everything
10 April 20, 2023
Challenges (3/3)
Find the right metrics to characterize all limitations Not too many Need to gather a lot of experience
Get it right! Several methods for computing a particular metrics
Choose the “best” for the situation Try to maximize correctness of results E.g. 5 ways to estimate RTTs
Careful validations Benchmark with a lot of reference traces Cross validate metrics
11 April 20, 2023
Outline Introduction and Motivation
Root cause analysis of TCP throughput: what and why?
Part 1: Methodology InTraBase: Integrated Traffic Analysis Based on
Object Relational DBMS
Part 2: Root cause analysis techniques Taxonomy of TCP rate limitation causes Our approach to infer limitation causes
Part 3: Case study on Performance Analysis of ADSL Clients
Conclusions Contributions Future work
12 April 20, 2023
Why did we need InTraBase?
First try: ad-hoc scripts and specialized software tools (tcptrace et al.)
Problems:1. Management
• Data, metadata, and tools• Got lost with files containing data and
ad-hoc scripts• Lot of metrics to compute and combine
2. Cumbersome analysis process• Iterative analysis• Data loses semantics and structure
3. Scalability• Cannot analyze large enough data sets
Filter
Process
Combine
Store
Interpret
13 April 20, 2023
Our InTraBase approach
Application logsWeb100 Raw base data
files
Network link
Base data
Results
Queries
Meta data
Database SystemApplication
TCP
IP
Preprocess
tcpdumpFunctions
Store traffic measurements in files as base data
Upload base data into the db and process it within the db Issue SQL queries Object-relational DBMS create functions for advanced
processing
14 April 20, 2023
Benefits from a DBMS-based Approach
Organize and manage data, related metadata, analysis results and tools
Data becomes structured and has semantics Processing and updating data is easier
Tools “understand” the data higher-level programming
Searching is more efficient (indexes) Store reusable intermediate results It is easier to combine different data sources
E.g. across OSI layers
15 April 20, 2023
SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2;
SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2;
SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2;
SELECT plot_ts_hist(‘SELECT * FROM iat(t2.cnxid,t2.reverse,”packets”)','histogram.pdf') FROM (SELECT cnxid,reverse FROM cnxs,(SELECT max(throughput) FROM cnxs) AS t1 WHERE cnxs.throughput=t1.max) AS t2;
Histogram of the packet inter-arrival times of the fastest connection
0
10
20
30
40
50
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
connections
bytespackets
tput…
connection id
packets
timestampstart #seqend #seq
flags…
connection id
iat(…) plot_ts_hist()
histogram.pdf
16 April 20, 2023
Outline Introduction and Motivation
Root cause analysis of TCP throughput: what and why?
Part 1: Methodology InTraBase: Integrated Traffic Analysis Based on Object
Relational DBMS
Part 2: Root cause analysis techniques Taxonomy of TCP rate limitation causes Our approach to infer limitation causes
Part 3: Case study on Performance Analysis of ADSL Clients
Conclusions Contributions Future work
17 April 20, 2023
Scope
Study long lived TCP connections Short connections are another topic
Dominated by slow start?
Assume FIFO scheduling Necessary for link capacity estimations with packet
dispersion techniques Reasonable assumption for most traffic May not hold for cable modem and 802.11 access networks
18 April 20, 2023
Limitation Causes for TCP Throughput
Application
Transport layer TCP receiver
Receiver window limitation TCP protocol
Slow start…
Network layer Bottleneck link
19 April 20, 2023
Application that sends larger bursts separated by idle periods BitTorrent, HTTP/1.1 (persistent)
only keep-alive messages
transfer periods
20 April 20, 2023
Limitation Causes: Application
The application does not even attempt to use all network resources
TCP connections are partitioned into two periods: Bulk Transfer Period (BTP): application provides
constantly data to transfer Never run out of data in buffer B1
Application Limited Period (ALP): opposite of BTP TCP has to wait for data because B1 is empty
Application Application
TCP TCPNetwork
Sender Receiver
buffersB1
21 April 20, 2023
Limitation Causes: TCP Receiver
Receiver advertized window limits the rate max amount of outstanding bytes =
min(cwnd,rwnd) Sender is idle waiting for ACKs to arrive
Flow control Sender application overflows receiving application Buffer B2 is full
Configuration problem (unintentional) default receiver advertized window is set too low window scaling is not enabled
Application Application
TCP TCPNetwork
Sender Receiver
buffersB2
22 April 20, 2023
Limitation Causes: Network
Limitation is due to congestion at a bottleneck link Shared bottleneck: obtain only a fraction of its capacity Non-shared bottleneck: obtain all of its capacity
23 April 20, 2023
Our Approach to Root Cause Analysis
Divide & Conquer1. Partition connections into BTPs and ALPs
Filter out application impact
2. Analyze the bulk transfer periods for limitation by TCP receiver TCP protocol Network
Methods are based on metrics computed from packet headers
24 April 20, 2023
Why filter out application effect? Many TCP/IP –level traffic studies do not account for application
effect RTTs, burstiness… Try to study network properties but end up measuring application
effect instead!
25 April 20, 2023
Distinguishing BTPs from ALPs:Isolate & Merge algorithm
1. phase: Isolate Fact: TCP always tries to send MSS size packets Consequence: small packets (size < MSS) and idle
time indicate application limitation Buffer between application and TCP is empty
TimeIdle time > RTT
MSS packet
packet smaller than MSS
ALP
…
ALP
…
large fraction of small packets
26 April 20, 2023
Distinguishing BTPs from ALPs:Isolate & Merge algorithm
2. phase: Merge Why?
After Isolate, BTPs may be separated by very short ALPs Analyze impact of the application
• How much ALPs decrease overall throughput?
How? Merge subsequent transfer periods separated by ALP to create a
new BTP Mergers controlled with drop parameter Iterate until all possible mergers are performed
27 April 20, 2023
BTP Analysis
1. Compute limitation scores for each BTP 4 quantitative scores
[0,1] We use retransmission rates, inter-arrival time
patterns, path capacity, RTT etc.
2. Perform classification of BTPs into limitation causes Map (combination of) limitation scores into a cause Threshold-based scheme
28 April 20, 2023
Classification scheme
4 thresholds need to be set
b-score
Dispersion score
Retransmissionscore
Receiver windowlimitation score
29 April 20, 2023
Classification: calibrating the thresholds
Difficult task: Diversity vs. Control Reference data needs to be representative & diverse enough
No simulations Need to control experiments in some way to get what we
want
Reference data with partially controlled experiments Try to generate transfers limited by certain cause FTP downloads from Fedora Core mirror sites
232 sites covering all continents Artificial bottleneck links with rshaper
network limitation Nistnet to add delay
receiver limitation (Wr/RTT < bw)
Control the number of simultaneous downloads unshared vs. shared bottleneck
InternetInternet
AustraliaJapan
FinlandUSA
EurecomRshaperNistnet
30 April 20, 2023
Classification: calibrating the thresholdsexample
bottleneck set at 1 Mbit/s, 1 download at a time
set th1 here
31 April 20, 2023
Outline Introduction and Motivation
Root cause analysis of TCP throughput: what and why?
Part 1: Methodology InTraBase: Integrated Traffic Analysis Based on Object
Relational DBMS
Part 2: Root cause analysis techniques Taxonomy of TCP rate limitation causes Our approach to infer limitation causes
Part 3: Case study on Performance Analysis of ADSL Clients
Conclusions Contributions Future work
32 April 20, 2023
Motivation
Stress test for our techniques Do we learn useful things?
Knowing throughput limitations (=performance) is useful ISPs want satisfied clients Need to know what’s going on before things can be improved
Installed InTraBase at France Telecom to study traffic at their ADSL access network Root cause analysis techniques implemented within InTraBase
33 April 20, 2023
Measurement Setup
24 hours of traffic on March 10, 2006
290 GB of TCP traffic 64% downstream, 36% upstream
Observed packets from ~3000 clients, analyze only 1335 Excluded clients did not generate enough traffic for RCA
Two pcap probes here
Internetcollectnetwork
accessnetwork
34 April 20, 2023
Connections Size distribution highly skewed Use only 1% of them for RCA
Represent > 85% of all traffic
Clients Heavy-hitters: 15% of clients generate 85-90% of traffic (up &
down) Low access link utilization
Why?
Warming up…
35 April 20, 2023
Results of Limitation Analysis
Striking result Application limits performance of over 80% of clients What’s going on?
36 April 20, 2023
Application analysis:Application limited traffic
Quite stable and symmetric volumes Over 80% of all traffic
eDonkey and “other” dominateP2P
other
eDonkey
37 April 20, 2023
Application analysis:Saturated access link
No recognized P2P Asymmetric port 80/8080 downstream
Real Web traffic?
38 April 20, 2023
Connecting the evidence…
Most clients’ performance limited by applications Very low link utilizations for application limited traffic Most of application limited traffic seems to be P2P
Peers often have asymmetric uplink and downlink capacities P2P applications/users enforce upload rate limits
Most clients’ download performance seems to suffer from P2P clients drastically limiting their upload rates
Internet
Internet
Low utilization Low capacity+rate limiter
downloading client
uploadingclients
39 April 20, 2023
Outline Introduction and Motivation
Root cause analysis of TCP throughput: what and why?
Part 1: Methodology InTraBase: Integrated Traffic Analysis Based on Object
Relational DBMS
Part 2: Root cause analysis techniques Taxonomy of TCP rate limitation causes Our approach to infer limitation causes
Part 3: Case study on Performance Analysis of ADSL Clients
Conclusions Contributions Future work
40 April 20, 2023
ConclusionsClaims and contributions
Part 1
Part 2
Part 2
Part 3
1. DBMSs provide powerful infrastructure for analysis of passive traffic measurements Performance is good.
2. We can infer root causes for TCP throughput using bidirectional packet traces at single measurement point located anywhere on the
TCP/IP path.
3. Today’s Internet applications interact in diverse ways with TCP Bias/error in TCP/IP path analysis Filter out their effects first
4. TCP root cause analysis techniques with DBMS-based analysis enable: performance evaluation of applications, evaluation of network utilization, and identification of TCP configuration problems.
41 April 20, 2023
The case is not yet closed…
Short connections Challenge previous “old” results with RCA What about persistent connections?
Wireless traffic Non-FIFO scheduling Link-layer issues
Extended case study on ADSL clients We saw a day, what about a week? Trends, consistency
42 April 20, 2023
Thank you!
Questions?