iit bombay network measurements

IIT BOMBAY NETWORK MEASUREMENTSMONITORING THE PERFORMANCE OF BACKHAUL CAMPUS NETWORK

Submitted by:Manveer Singh Chawla

Guided by:Prof. Purushottam Kulkarni

OVERVIEW Motivation Problem statement Related Work IIT Bombay Network Background Our Solution

Architecture Implementation

Experimental Evaluation Network measurement data Proxy log analysis

Future Work Thesis Contribution

MOTIVATION Consider following scenarios

User writes a mail, clicks send but sending fails!! User is talking with a friend on gtalk and it

disconnects User is browsing web but the browsing speed is

very slow What will a novice user do?

No structured approach: Starts fiddling around with network settings Reboots machine

Result? Wastes a lot of time May not even find the cause

MOTIVATION CNTD. Multiple points of failure

User’s machine Incorrect network settings Failure of ethernet card/cable

LAN Switch Router DNS Proxy

WAN Web Server Network Congestion

No user control over LAN / WAN failures

PROBLEM DEFINITION1. Build a measurement tool which monitors

the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy.

2. A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day

RELATED WORK Jigsaw

Merge traces to passively measure queuing delays, throughput

We summarize a trace to determine status of nodes

WiFiProfilerFault diagnosis in wireless setting for user machinePerform distributed analysis Ours is centralized processing of wired network

Network measurement tools Pathchar: bandwidth, queue size, packet drop rateTraceroute: RTT, Topology

IIT BOMBAY NETWORK

SERVICES Proxy: netmon

Web caching Authentication Content filtering

Firewall NATing Packet filtering Internal and External

DNS DNS server for campus DNS servers in few subnets

Monitoring Traffic statistics

WORKING OF PROXY

MEASUREMENT CHALLENGES

Permission from Computer Centre Large volume of data

Unaware and amateur users Specific h/w required What to measure in such a large network

Use existing infrastructure Old h/w: unpredictable failures WAN: firewall makes difficult to diagnose

OUR SOLUTION

ARCHITECTURE

SERVER NODE

• Send logs to diagnostic-node after collection

ICMP PORT_UNREA

CHABLE

Query reply from server

Bad request on HTTP GET

request

CLIENT NODE

• Send logs to diagnostic-node after collection

DIAGNOSTIC NODE

DIAGNOSTIC NODE CNTD.

Is it seen by all?

Machine down with failure

Failure Seen

Machine overloaded

Yes

No

Determining the status of proxy (netmon)


Is it not

reach-able for all?

Machine overloaded

Problem in hierarchy

Determining the status of dns servers

Send back

to back querie-s

No

Yes

Machine Down

internal answeredexternal notanswered

other cases


Offline mode statistics for specified time period

Online mode statistics for last 10 minutes

Remote query mode query status of node at specified time

EXPERIMENTAL EVALUATION

SETUP Server node at 8 locations around the campus Client node at 3 locations around campus Collected data from 26th March – 15th June

No data for 25th May to 2nd Jun Measurements for following nodes:IP Address Name192.0.50.1 h8router-interface110.12.250.1 h8router-interface210.12.250.2 h12switch10.2.250.1 h3router-interface110.105.250.1

cserouter-interface1

10.129.1.1 kresit-dns10.200.1.11 iitbombay-dns10.105.1.7 cse-dns

IP Address Name10.107.1.250

ccrouter-interface1

192.0.20.2 ccrouter-interface2



10.129.250.1

ccrouter-interface5

10.129.1.250

ccrouter-interface6

10.165.250.1

ccrouter-interface7

netmon.iitb netmon

DNS SERVICE TIME DISTRIBUTION

DNS SERVICE TIME DISTRIBUTION: OBSERVATIONS

• Median response time is very less for all• Average is significantly greater than median

• heavy tailed• kresit-dns has much higher average and 90th percentile

OUTAGE DISTRIBUTIONS

• Most of the outages are of smaller length. • Median is <= 2 minutes, 90th Percentile <= 10 for almost all.

PERCENTAGE DOWNTIME ACROSS DAYS

• On most of the days downtimes are < 2 % for most of the nodes.• There is not much pattern across days

COMBINED DOWNTIME

netmon ~ 0.24 % Percentage time atleast on interface is not

working is close to all not working Either machine goes down Or the measurements are not taking place at same

time Time to check the status of machine is variable

Element Atleast one down (%)

All not working (in

%)Hostel 8 Router 2.144 1.980Hostel 3 Router 0.686 0.686

CC Router 0.657 0.565DNS Servers 0.414 0.406

RESULTS SUMMARY Router failure > DNS failure > netmon failure Median node outage <= 2 min Small number of outages each day

No pattern across days Average DNS Service time ~ 300 ms netmon is less than generally perceived

Dependence on other services: LDAP, DNS A lot of machinery in the network is old

PROXY LOG ANALYSIS

MOTIVATION Per day logs are huge, over 6 Gb Storing logs to perform long historical

analysis a problem Over 2 Tb for a year !

What is the traffic distribution ? What is the object size distribution ? What is response time distribution ? Is there some trend across days? What strategy can be used to select logs for

long term historical analysis ?

PROBLEM DEFINITION1. Build a measurement tool which monitors

the status of elements in network back- bone, such that in case of network failure, it is able to detect and diagnose the cause of failure. These elements include the subnet routers, switches, DNS servers and network proxy.

2. A measurement study of the network proxy to study the response time variation, traffic pattern and object size variation across the day

PROXY LOG ANALYSIS Log file has following format

Month Date Time Proxy_Server squid_process_id epoch_timestamp process_time_ms source_ip tcp_status/http_status_code object_size request_type URL user_id hierarchy_code/server_ip object_type/object_sub_type

Stored in a MySQL database Processed logs for a week from

May 14, 2009 – May 20, 2009 Size of the log file ~ 6 Gb Number of requests in a day ~ 22 million Bytes downloaded ~ 401.6 Gb

TRAFFIC DISTRIBUTION ON OBJECT TYPE: REQUESTS

• Percentage distribution remain same across days• Multimedia traffic is the least ~ 0.2 % • Text traffic is maximum ~ 40 %

TRAFFIC DISTRIBUTION ON OBJECT TYPE: DOWNLOADED BYTES

• Percentage distribution remain same across days• Multimedia traffic is the maximum ~ 38 %

TRAFFIC DISTRIBUTION ON LOCATION: REQUESTS

• Percentage distribution remain same across week days• Increase in hostel traffic on weekends• Decrease in academic traffic on weekends

TRAFFIC DISTRIBUTION ON LOCATION: DOWNLOADED BYTES

• Percentage distribution for downloaded bytes follow number of requests• Object type distribution remains same across days, thus majority of users have similar behavior in different locations

TRAFFIC DISTRIBUTION: SUMMARYCategor

yApplication

(in %)Image(in %)

Text(in %)

Multimedia

(in %)

Other(in %)

Requests11.02 35.43 42.76 0.18 10.61

Bytes 30.52 12.05 14.94 38.28 4.20

Category

Admin(in %)

Acad(in %)

Hostel(in %)

Resnet(in %)

Requests3.50 28.16 61.90 6.58

Bytes 2.83 25.59 64.73 6.85

NUMBER OF ARRIVALS PER SECOND

• Lesser activity from 2 a.m. – 11 a.m, lan curtailment• Higher activity points at 3 p.m., 7 p.m., and 11 p.m.• Average ~ 250 , Standard Deviation ~ 135

NUMBER OF REQUESTS CONCURRENTLY SERVED

• Average ~ 2000 , Standard Deviation ~ 859 • Follows the arrival curve

MEAN RESPONSE TIME AT TIME OF DAY

• Response time remains almost constant throughout the day• A peak at around 4 a.m. • Average ~ 9.8 seconds

MEDIAN RESPONSE TIME AT TIME OF DAY

• Median Response time remains constant throughout the day, 480 ms for the day• Median curve is a better estimate of average value on a day • Both the median and mean response time do not follow requests concurrently served and arrival curve

CUMULATIVE RESPONSE TIME DISTRIBUTION

• For multimedia the curve becomes linear• For remaining categories it is heavy tailed• Median response times: application ~472 ms, text ~ 563 ms, image ~ 172 ms, multimedia ~ 10175 ms and other ~ 672 ms

CUMULATIVE OBJECT SIZE DISTRIBUTION

• For multimedia object sizes are more evenly distributed• Remaining categories have 90 % of objects < 10 Kb • Median object sizes: application ~1.5 Kb, text ~ 0.8 Kb, image ~ 1.7 Kb, multimedia ~ 903 Kb and other ~ 0.46 Kb

RESULTS SUMMARY Multimedia traffic is the major part of WAN

traffic Percentage traffic distribution

Similar across object type on days Similar in different areas except on weekends Thus any log file can be selected as a

representative of the week Larger log file for more data one for weekend and one for weekdays

FUTURE WORK Characterization of request processing time

at proxy Explore the other causes of failure including

the LDAP service Explore the failures from the side of ISP, from

a point outside the network Studying the traffic within LAN

THESIS CONTRIBUTIONS Studied the tools and methodologies used for

network measurement Surveyed and documented the campus

network of IIT Bombay Architecture Services Failures

Developed a tool to detect some of the failures Can be easily extended to detect others

Experimental evaluation of tool by setting up testbed

Measurement analysis of proxy logs

BIBLIOGRAPHY[1] Computer Center, IIT Bombay.

http://www.cc.iitb.ac.in[2] dnscache. http://cr.yp.to/djbdns/dnscache.html[3] Iperf. http://dast.nlanr.net/Projects/Iperf/ [4] iptables.

http://www.netfilter.org/projects/iptables/index.html. [5] Jpcap: a Java library for capturing and sending

network packets. http://netresearch.ics.uci.edu/kfujii/jpcap/doc/.

[6] Squid logs. http://wiki.squid-cache.org/SquidFaq/SquidLogs

[7] Traceroute. http://sourceforge.net/projects/traceroute

http://www.cc.iitb.ac.in/

http://cr.yp.to/djbdns/dnscache.html

http://wiki.squid-cache.org/SquidFaq/SquidLogs

BIBLIOGRAPHY CNTD.[8] Ultra monkey. http://www.ultramonkey.org/[9] Wikimedia.

http://www.squid-cache.org/Library/wikimedia.dyn [10] Kostas G. Anagnostakis, Michael Greenwald,

and Raphael Ryger. cing: Measuring network-internal delays using only existing infrastructure. In proceedings of IEEE Infocom, April 2003.

[11] Ranveer Chandra, Venkata N. Padmanabhan, and Ming Zhang. Wifiprofiler: Cooper- ative Diagnosis in Wireless LANs. In Proceedings of the 4th international conference on Mobile systems, applications and services, June 2006.

http://www.ultramonkey.org/

BIBLIOGRAPHY CNTD.[12] Yu-Chung Cheng, John Bellardo, Peter

Benko, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. Jigsaw: Solving the Puzzle of Enterprise 802.11 Analysis. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, September 2006

[13] Ramesh Govindan and Hongsuda Tangmunarunkit. Heuristics for Internet map dis- covery. In proceedings of Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, 2000. 101102 Bibliography

BIBLIOGRAPHY CNTD.[14] Bradley Huffaker, Marina Fomenkov, David

Moore, and Ke Claffey. Macroscopic analyses of the infrastructure: measurement and visualization of Internet connectivity and performance. In proceedings of Passive and Active Measurements, 2001

[15] Van Jacobson. pathchar - a tool to infer characteristics of Internet paths, 1997.

[16] Alex Rousskov and Valery Soloviev. A performance study of the Squid proxy on HTTP/1.0. World-Wide Web Journal, Special Edition on WWW Characterization and Performance Evaluation, 1999.

BIBLIOGRAPHY CNTD.[17] Stefan Savage. Sting: a TCP-based Network

Measurement Tool. In Proceedings of the Second Conference on USENIX Symposium on Internet Technologies and Systems, 1999.

[18] Subhabrata Sen and Jia Wang. Analyzing peer-to-peer traffic across large networks. In Proceedings of the 2006 ACM CoNEXT conference, 2006.

[19] Nirav S. Uchat. IIT bombay web traffic characterization.

[20] Ameya P. Usgaonkar. Network Performance Analysis by Mining Multi-Variate Time Series Data, January 2001.

Extra Slides

RELATED WORKPassive Measurement

WiFiProfiler collaborative diagnosis, information from neighbors blame assignment algorithm to predict actual

cause Jigsaw

collect and merge traces from multiple vantage points

create single unified view of network large scale synchronization frame unification

Measures queuing delays experienced by users throughput: compare observed vs expected (using

RTT,path loss) effect of mobility techniques: scanning, dhcp, initial

association

RELATED WORK CNTD Squid Log Analysis by Rousskov et. al

Logs from seven proxies, 18 days of logs Applied patch to squid to measure: proxy connect time,

client connect time, server reply time, proxy reply time, swap-in time and swap-out time

Studied traffic distribution, response time at proxy, number of requests at proxy, disk traffic intensity, disk utilization, disk response time: all against TCP_STATUS i.e. HITS and MISS

Shortcomings: No long term historical analysis No comparison of direct traffic with proxied traffic

Active measurements Pathchar: bandwidth, queue size, packet drop

rate Traceroute: RTT, Topology

RELATED WORK CNTD Active Measurement

PathChar measures: bandwidth, queue size, packet drop rate uses TTL field in IP header series of probes with varying packet size

Neglecting, queuing delay, Serror/B and tprocessing, reduces to RTT = Spacket/B

Packet loss: number of error messages received Statistic for node n = Statistic till nth node -

Statistic till n-1th node

APPLICATION LAYER FAILURES Web Access Failures

Service Unavailable Connection timed out

failure Connection refused Connection reset Gateway Timeout No data received

Connection Closed at an intermediate byte

DNS Access Failures Connection Timed Out Blank answer field

Router Access Failures No Route To Host No response received

IMPLEMENTATION Client Module

Snoop on incoming packets using jpcap library Node reachablity

If any packets received -> Subnet switch reachable If IP packets from other subnet received -> Subnet

router reachable If IP packets from DNS server received -> DNS server

reachable If IP packets from netmon received -> netmon

reachable Traffic characteristics

Size of packet Delay using inter-arrival time of two packets

Threads to synchronize measurements and information sending

IMPLEMENTATION CNTD Server Module

Check plugged in ethernet cable: status of interface using ifconfig

Status of switch/router: Hop limited IP packets, using traceroute

Status of DNS server: Query to the DNS server using dig

Status ofproxy: a HTTP get request at port 80 using wget

Web download: Using wget Using JAVA runtime library to run these utilities Synchronize using SNTP protocol: implemented

in JAVA

IMPLEMENTATION CNTD Communication Module

Used for sending logs and querying diagnostic-nodes Implemented using JAVA Net package

Receiver listens on a port Sender connects and sends the logs/query Our protocol to send and receive messages

Logging Module Used by diagnostic, server and client nodes Stores log in directory hierarchy: ip/yyyy/mm/dd Unsent logs stored to be sent in future New threads are created to make logs -> prevent

blocking Implemented using JAVA threads and JAVA IO package

IMPLEMENTATION CNTD Diagnostic Module

Uses the logs of server and client nodes Continuous mode

Analyzes statistics every 10 minutes Statistics generated

Node outages, Percentage status distribution, last uptime status of nodes, DNS service time statistics

Offline mode User specifies the start and end time of measurement Statistics generated

Node outages, Percentage status distribution, last uptime status of nodes, DNS service time statistics, Node status at given time

Remote query mode User can query about node status at given time

PERCENTAGE DOWN-TIME ON A DAY: DNS SERVERS

• Most of the days percentage downtime is < 1 % for all servers• No pattern in down-time across days

PERCENTAGE DOWN-TIME ON A DAY: NETMON

• Most of the days percentage downtime is < 0.2 %• No pattern in down-time across days

PERCENTAGE DOWN-TIME ON A DAY: HOSTEL 8 ROUTER

• Most of the days percentage downtime is < 2 %• No pattern in down-time across days

PERCENTAGE DOWN-TIME ON A DAY: CC ROUTER

• Most of the days percentage downtime is < 1 %• No pattern in down-time across days

OUTAGE LENGTH DISTRIBUTION: CC ROUTER

• Most of the outages are of smaller length

2

6

132

133

OUTAGE LENGTH DISTRIBUTION: DNS

• Most of the outages are of small length• Smaller number of outages

2764

1

OUTAGE LENGTH DISTRIBUTION: NETMON

• Most of the outages are of length < 3 min

3

OUTAGE LENGTH DISTRIBUTION: HOSTEL 8 ROUTER

• Most of the outages are of smaller length

2

4

7

OUTAGE LENGTHS

STATUS DISTRIBUTION

PROXY RESPONSE TIME VS USER RESPONSE TIME

EXPERIMENT: PROXY FAILURE Setup:

wget to fetch berkley and netmon (http://netmon.iitb.ac.in)

Repeatedly performed at ever 6 minute interval From 2:42 on 22nd September to 1:06 on 25th Septmber

from kresit (10.129.41.189) 400 bad request response, denoted by 1, indicates proxy

is up -1 for connection refused error -3 for 503 server error

Result netmon: 0.7 % connection refused error berkley: 8.7% connection refused error, 0.28 503 error Intersection of failure implies

Machine not running, or Port is closed

http://netmon.iitb.ac.in/

EXPERIMENT: PROXY FAILURE CNTD

EXPERIMENT: DNS FAILURE Setup

dig to send back-to-back probes to dns.iitb.ac.in Periodically sent once every 2 minutes Conducted fro 22:06 on 17th September to 13:36

on 18th September from kresit(10.129.41.189) One query for internal domain and other for

external Both the domains randomly generated 1 -> answer field present, 0 -> answer field not

present Result

External queries failed 2.36 % of time Internal queries never failed

EXPERIMENT: DNS FAILURE CNTD

iit bombay network measurements

Documents

network proxy

case of network failure

campusclient node

measurement study

causemotivation cntd

novice user

status of elements

cause of failure