iit bombay network measurements
DESCRIPTION
IIT BOMBAY NETWORK MEASUREMENTS. Guided by: Prof. Purushottam Kulkarni. Submitted by: Manveer Singh Chawla. MONITORING THE PERFORMANCE OF BACKHAUL CAMPUS NETWORK. OVERVIEW. Motivation Problem statement Related Work IIT Bombay Network Background Our Solution Architecture Implementation - PowerPoint PPT PresentationTRANSCRIPT
IIT BOMBAY NETWORK MEASUREMENTSMONITORING THE PERFORMANCE OF BACKHAUL CAMPUS NETWORK
Submitted by:Manveer Singh Chawla
Guided by:Prof. Purushottam Kulkarni
OVERVIEW Motivation Problem statement Related Work IIT Bombay Network Background Our Solution
Architecture Implementation
Experimental Evaluation Network measurement data Proxy log analysis
Future Work Thesis Contribution
MOTIVATION Consider following scenarios
User writes a mail, clicks send but sending fails!! User is talking with a friend on gtalk and it
disconnects User is browsing web but the browsing speed is
very slow What will a novice user do?
No structured approach: Starts fiddling around with network settings Reboots machine
Result? Wastes a lot of time May not even find the cause
MOTIVATION CNTD. Multiple points of failure
User’s machine Incorrect network settings Failure of ethernet card/cable
LAN Switch Router DNS Proxy
WAN Web Server Network Congestion
No user control over LAN / WAN failures
PROBLEM DEFINITION1. Build a measurement tool which monitors
the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy.
2. A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day
RELATED WORK Jigsaw
Merge traces to passively measure queuing delays, throughput
We summarize a trace to determine status of nodes
WiFiProfilerFault diagnosis in wireless setting for user machinePerform distributed analysis Ours is centralized processing of wired network
Network measurement tools Pathchar: bandwidth, queue size, packet drop rateTraceroute: RTT, Topology
IIT BOMBAY NETWORK
MAP
SERVICES Proxy: netmon
Web caching Authentication Content filtering
Firewall NATing Packet filtering Internal and External
DNS DNS server for campus DNS servers in few subnets
Monitoring Traffic statistics
WORKING OF PROXY
MEASUREMENT CHALLENGES
Permission from Computer Centre Large volume of data
Unaware and amateur users Specific h/w required What to measure in such a large network
Use existing infrastructure Old h/w: unpredictable failures WAN: firewall makes difficult to diagnose
OUR SOLUTION
ARCHITECTURE
SERVER NODE
• Send logs to diagnostic-node after collection
ICMP PORT_UNREA
CHABLE
Query reply from server
Bad request on HTTP GET
request
CLIENT NODE
• Send logs to diagnostic-node after collection
DIAGNOSTIC NODE
DIAGNOSTIC NODE CNTD.
Is it seen by all?
Machine down with failure
Failure Seen
Machine overloaded
Yes
No
Determining the status of proxy (netmon)
DIAGNOSTIC NODE CNTD.
Is it not
reach-able for all?
Machine overloaded
Problem in hierarchy
Determining the status of dns servers
Send back
to back querie-s
No
Yes
Machine Down
internal answeredexternal notanswered
other cases
DIAGNOSTIC NODE CNTD.
Offline mode statistics for specified time period
Online mode statistics for last 10 minutes
Remote query mode query status of node at specified time
EXPERIMENTAL EVALUATION
SETUP Server node at 8 locations around the campus Client node at 3 locations around campus Collected data from 26th March – 15th June
No data for 25th May to 2nd Jun Measurements for following nodes:IP Address Name192.0.50.1 h8router-interface110.12.250.1 h8router-interface210.12.250.2 h12switch10.2.250.1 h3router-interface110.105.250.1
cserouter-interface1
10.129.1.1 kresit-dns10.200.1.11 iitbombay-dns10.105.1.7 cse-dns
IP Address Name10.107.1.250
ccrouter-interface1
192.0.20.2 ccrouter-interface2
192.0.40.2 ccrouter-interface3
192.0.50.2 ccrouter-interface4
10.129.250.1
ccrouter-interface5
10.129.1.250
ccrouter-interface6
10.165.250.1
ccrouter-interface7
netmon.iitb netmon
DNS SERVICE TIME DISTRIBUTION
DNS SERVICE TIME DISTRIBUTION: OBSERVATIONS
• Median response time is very less for all• Average is significantly greater than median
• heavy tailed• kresit-dns has much higher average and 90th percentile
OUTAGE DISTRIBUTIONS
• Most of the outages are of smaller length. • Median is <= 2 minutes, 90th Percentile <= 10 for almost all.
PERCENTAGE DOWNTIME ACROSS DAYS
• On most of the days downtimes are < 2 % for most of the nodes.• There is not much pattern across days
COMBINED DOWNTIME
netmon ~ 0.24 % Percentage time atleast on interface is not
working is close to all not working Either machine goes down Or the measurements are not taking place at same
time Time to check the status of machine is variable
Element Atleast one down (%)
All not working (in
%)Hostel 8 Router 2.144 1.980Hostel 3 Router 0.686 0.686
CC Router 0.657 0.565DNS Servers 0.414 0.406
RESULTS SUMMARY Router failure > DNS failure > netmon failure Median node outage <= 2 min Small number of outages each day
No pattern across days Average DNS Service time ~ 300 ms netmon is less than generally perceived
Dependence on other services: LDAP, DNS A lot of machinery in the network is old
PROXY LOG ANALYSIS
MOTIVATION Per day logs are huge, over 6 Gb Storing logs to perform long historical
analysis a problem Over 2 Tb for a year !
What is the traffic distribution ? What is the object size distribution ? What is response time distribution ? Is there some trend across days? What strategy can be used to select logs for
long term historical analysis ?
PROBLEM DEFINITION1. Build a measurement tool which monitors
the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy.
2. A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day
PROXY LOG ANALYSIS Log file has following format
Month Date Time Proxy_Server squid_process_id epoch_timestamp process_time_ms source_ip tcp_status/http_status_code object_size request_type URL user_id hierarchy_code/server_ip object_type/object_sub_type
Stored in a MySQL database Processed logs for a week from
May 14, 2009 – May 20, 2009 Size of the log file ~ 6 Gb Number of requests in a day ~ 22 million Bytes downloaded ~ 401.6 Gb
TRAFFIC DISTRIBUTION ON OBJECT TYPE: REQUESTS
• Percentage distribution remain same across days• Multimedia traffic is the least ~ 0.2 % • Text traffic is maximum ~ 40 %
TRAFFIC DISTRIBUTION ON OBJECT TYPE: DOWNLOADED BYTES
• Percentage distribution remain same across days• Multimedia traffic is the maximum ~ 38 %
TRAFFIC DISTRIBUTION ON LOCATION: REQUESTS
• Percentage distribution remain same across week days• Increase in hostel traffic on weekends• Decrease in academic traffic on weekends
TRAFFIC DISTRIBUTION ON LOCATION: DOWNLOADED BYTES
• Percentage distribution for downloaded bytes follow number of requests• Object type distribution remains same across days, thus majority of users have similar behavior in different locations
TRAFFIC DISTRIBUTION: SUMMARYCategor
yApplication
(in %)Image(in %)
Text(in %)
Multimedia
(in %)
Other(in %)
Requests11.02 35.43 42.76 0.18 10.61
Bytes 30.52 12.05 14.94 38.28 4.20
Category
Admin(in %)
Acad(in %)
Hostel(in %)
Resnet(in %)
Requests3.50 28.16 61.90 6.58
Bytes 2.83 25.59 64.73 6.85
NUMBER OF ARRIVALS PER SECOND
• Lesser activity from 2 a.m. – 11 a.m, lan curtailment• Higher activity points at 3 p.m., 7 p.m., and 11 p.m.• Average ~ 250 , Standard Deviation ~ 135
NUMBER OF REQUESTS CONCURRENTLY SERVED
• Average ~ 2000 , Standard Deviation ~ 859 • Follows the arrival curve
MEAN RESPONSE TIME AT TIME OF DAY
• Response time remains almost constant throughout the day• A peak at around 4 a.m. • Average ~ 9.8 seconds
MEDIAN RESPONSE TIME AT TIME OF DAY
• Median Response time remains constant throughout the day, 480 ms for the day• Median curve is a better estimate of average value on a day • Both the median and mean response time do not follow requests concurrently served and arrival curve
CUMULATIVE RESPONSE TIME DISTRIBUTION
• For multimedia the curve becomes linear• For remaining categories it is heavy tailed• Median response times: application ~472 ms, text ~ 563 ms, image ~ 172 ms, multimedia ~ 10175 ms and other ~ 672 ms
CUMULATIVE OBJECT SIZE DISTRIBUTION
• For multimedia object sizes are more evenly distributed• Remaining categories have 90 % of objects < 10 Kb • Median object sizes: application ~1.5 Kb, text ~ 0.8 Kb, image ~ 1.7 Kb, multimedia ~ 903 Kb and other ~ 0.46 Kb
RESULTS SUMMARY Multimedia traffic is the major part of WAN
traffic Percentage traffic distribution
Similar across object type on days Similar in different areas except on weekends Thus any log file can be selected as a
representative of the week Larger log file for more data one for weekend and one for weekdays
FUTURE WORK Characterization of request processing time
at proxy Explore the other causes of failure including
the LDAP service Explore the failures from the side of ISP, from
a point outside the network Studying the traffic within LAN
THESIS CONTRIBUTIONS Studied the tools and methodologies used for
network measurement Surveyed and documented the campus
network of IIT Bombay Architecture Services Failures
Developed a tool to detect some of the failures Can be easily extended to detect others
Experimental evaluation of tool by setting up testbed
Measurement analysis of proxy logs
BIBLIOGRAPHY[1] Computer Center, IIT Bombay.
http://www.cc.iitb.ac.in[2] dnscache. http://cr.yp.to/djbdns/dnscache.html[3] Iperf. http://dast.nlanr.net/Projects/Iperf/ [4] iptables.
http://www.netfilter.org/projects/iptables/index.html. [5] Jpcap: a Java library for capturing and sending
network packets. http://netresearch.ics.uci.edu/kfujii/jpcap/doc/.
[6] Squid logs. http://wiki.squid-cache.org/SquidFaq/SquidLogs
[7] Traceroute. http://sourceforge.net/projects/traceroute
BIBLIOGRAPHY CNTD.[8] Ultra monkey. http://www.ultramonkey.org/[9] Wikimedia.
http://www.squid-cache.org/Library/wikimedia.dyn [10] Kostas G. Anagnostakis, Michael Greenwald,
and Raphael Ryger. cing: Measuring network-internal delays using only existing infrastructure. In proceedings of IEEE Infocom, April 2003.
[11] Ranveer Chandra, Venkata N. Padmanabhan, and Ming Zhang. Wifiprofiler: Cooper- ative Diagnosis in Wireless LANs. In Proceedings of the 4th international conference on Mobile systems, applications and services, June 2006.
BIBLIOGRAPHY CNTD.[12] Yu-Chung Cheng, John Bellardo, Peter
Benko, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. Jigsaw: Solving the Puzzle of Enterprise 802.11 Analysis. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, September 2006
[13] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for Internet map dis- covery. In proceedings of Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, 2000. 101102 Bibliography
BIBLIOGRAPHY CNTD.[14] Bradley Huffaker, Marina Fomenkov, David
Moore, and Ke Claffey. Macroscopic analyses of the infrastructure: measurement and visualization of Internet connectivity and performance. In proceedings of Passive and Active Measurements, 2001
[15] Van Jacobson. pathchar - a tool to infer characteristics of Internet paths, 1997.
[16] Alex Rousskov and Valery Soloviev. A performance study of the Squid proxy on HTTP/1.0. World-Wide Web Journal, Special Edition on WWW Characterization and Performance Evaluation, 1999.
BIBLIOGRAPHY CNTD.[17] Stefan Savage. Sting: a TCP-based Network
Measurement Tool. In Proceedings of the Second Conference on USENIX Symposium on Internet Technologies and Systems, 1999.
[18] Subhabrata Sen and Jia Wang. Analyzing peer-to-peer traffic across large networks. In Proceedings of the 2006 ACM CoNEXT conference, 2006.
[19] Nirav S. Uchat. IIT bombay web traffic characterization.
[20] Ameya P. Usgaonkar. Network Performance Analysis by Mining Multi-Variate Time Series Data, January 2001.
Extra Slides
RELATED WORKPassive Measurement
WiFiProfiler collaborative diagnosis, information from neighbors blame assignment algorithm to predict actual
cause Jigsaw
collect and merge traces from multiple vantage points
create single unified view of network large scale synchronization frame unification
Measures queuing delays experienced by users throughput: compare observed vs expected (using
RTT,path loss) effect of mobility techniques: scanning, dhcp, initial
association
RELATED WORK CNTD Squid Log Analysis by Rousskov et. al
Logs from seven proxies, 18 days of logs Applied patch to squid to measure: proxy connect time,
client connect time, server reply time, proxy reply time, swap-in time and swap-out time
Studied traffic distribution, response time at proxy, number of requests at proxy, disk traffic intensity, disk utilization, disk response time: all against TCP_STATUS i.e. HITS and MISS
Shortcomings: No long term historical analysis No comparison of direct traffic with proxied traffic
Active measurements Pathchar: bandwidth, queue size, packet drop
rate Traceroute: RTT, Topology
RELATED WORK CNTD Active Measurement
PathChar measures: bandwidth, queue size, packet drop rate uses TTL field in IP header series of probes with varying packet size
Neglecting, queuing delay, Serror/B and tprocessing, reduces to RTT = Spacket/B
Packet loss: number of error messages received Statistic for node n = Statistic till nth node -
Statistic till n-1th node
APPLICATION LAYER FAILURES Web Access Failures
Service Unavailable Connection timed out
failure Connection refused Connection reset Gateway Timeout No data received
Connection Closed at an intermediate byte
DNS Access Failures Connection Timed Out Blank answer field
Router Access Failures No Route To Host No response received
IMPLEMENTATION Client Module
Snoop on incoming packets using jpcap library Node reachablity
If any packets received -> Subnet switch reachable If IP packets from other subnet received -> Subnet
router reachable If IP packets from DNS server received -> DNS server
reachable If IP packets from netmon received -> netmon
reachable Traffic characteristics
Size of packet Delay using inter-arrival time of two packets
Threads to synchronize measurements and information sending
IMPLEMENTATION CNTD Server Module
Check plugged in ethernet cable: status of interface using ifconfig
Status of switch/router: Hop limited IP packets, using traceroute
Status of DNS server: Query to the DNS server using dig
Status ofproxy: a HTTP get request at port 80 using wget
Web download: Using wget Using JAVA runtime library to run these utilities Synchronize using SNTP protocol: implemented
in JAVA
IMPLEMENTATION CNTD Communication Module
Used for sending logs and querying diagnostic-nodes Implemented using JAVA Net package
Receiver listens on a port Sender connects and sends the logs/query Our protocol to send and receive messages
Logging Module Used by diagnostic, server and client nodes Stores log in directory hierarchy: ip/yyyy/mm/dd Unsent logs stored to be sent in future New threads are created to make logs -> prevent
blocking Implemented using JAVA threads and JAVA IO package
IMPLEMENTATION CNTD Diagnostic Module
Uses the logs of server and client nodes Continuous mode
Analyzes statistics every 10 minutes Statistics generated
Node outages, Percentage status distribution, last uptime status of nodes, DNS service time statistics
Offline mode User specifies the start and end time of measurement Statistics generated
Node outages, Percentage status distribution, last uptime status of nodes, DNS service time statistics, Node status at given time
Remote query mode User can query about node status at given time
PERCENTAGE DOWN-TIME ON A DAY: DNS SERVERS
• Most of the days percentage downtime is < 1 % for all servers• No pattern in down-time across days
PERCENTAGE DOWN-TIME ON A DAY: NETMON
• Most of the days percentage downtime is < 0.2 %• No pattern in down-time across days
PERCENTAGE DOWN-TIME ON A DAY: HOSTEL 8 ROUTER
• Most of the days percentage downtime is < 2 %• No pattern in down-time across days
PERCENTAGE DOWN-TIME ON A DAY: CC ROUTER
• Most of the days percentage downtime is < 1 %• No pattern in down-time across days
OUTAGE LENGTH DISTRIBUTION: CC ROUTER
• Most of the outages are of smaller length
2
6
132
133
OUTAGE LENGTH DISTRIBUTION: DNS
• Most of the outages are of small length• Smaller number of outages
2764
1
OUTAGE LENGTH DISTRIBUTION: NETMON
• Most of the outages are of length < 3 min
3
OUTAGE LENGTH DISTRIBUTION: HOSTEL 8 ROUTER
• Most of the outages are of smaller length
2
4
7
OUTAGE LENGTHS
STATUS DISTRIBUTION
PROXY RESPONSE TIME VS USER RESPONSE TIME
PROXY RESPONSE TIME VS USER RESPONSE TIME
PROXY RESPONSE TIME VS USER RESPONSE TIME
EXPERIMENT: PROXY FAILURE Setup:
wget to fetch berkley and netmon (http://netmon.iitb.ac.in)
Repeatedly performed at ever 6 minute interval From 2:42 on 22nd September to 1:06 on 25th Septmber
from kresit (10.129.41.189) 400 bad request response, denoted by 1, indicates proxy
is up -1 for connection refused error -3 for 503 server error
Result netmon: 0.7 % connection refused error berkley: 8.7% connection refused error, 0.28 503 error Intersection of failure implies
Machine not running, or Port is closed
EXPERIMENT: PROXY FAILURE CNTD
EXPERIMENT: DNS FAILURE Setup
dig to send back-to-back probes to dns.iitb.ac.in Periodically sent once every 2 minutes Conducted fro 22:06 on 17th September to 13:36
on 18th September from kresit(10.129.41.189) One query for internal domain and other for
external Both the domains randomly generated 1 -> answer field present, 0 -> answer field not
present Result
External queries failed 2.36 % of time Internal queries never failed
EXPERIMENT: DNS FAILURE CNTD