end-to-end monitoring of high performance network paths
DESCRIPTION
End-to-end Monitoring of High Performance Network Paths. Les Cottrell , Connie Logg, Jerrod Williams SLAC, for the ESCC meeting, Columbus Ohio, July 2004 www.slac.stanford.edu/grp/scs/net/talk03/escc-jul04.ppt. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/1.jpg)
1
End-to-end Monitoring of High Performance Network Paths
Les Cottrell, Connie Logg, Jerrod WilliamsSLAC, for the
ESCC meeting, Columbus Ohio, July 2004www.slac.stanford.edu/grp/scs/net/talk03/escc-jul04.ppt
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also
supported by IUPAP
![Page 2: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/2.jpg)
2
Need• Data intensive science (e.g. HENP) needs to
share data at high speeds• Needs high-performance, reliable e2e paths
and the ability to use them• End users need long and short term estimates
of network and application performance for: Planning, setting expectations & trouble shooting
• You can’t manage what you can’t measure
![Page 3: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/3.jpg)
3
IEPM-BW• Toolkit:
– Enables regular, E2E measurements with user selectable:• Tools: iperf (single & multi-stream), bbftp, bbcp, GridFTP, ping (RTT),
traceroute• Periods (with randomization)• Remote hosts to monitor
– Hierarchical to match the tiered approach of BaBar & LHC computation / collaboration infrastructures
– Includes:• Auto-clean up of hung processes at both ends• Management tools to look for failures (unreachable hosts, failing
tools etc.)• Web navigation of results• Visualization of data as time-series, histograms, scatter plots, tables• Access to data in machine readable form• Documentation on host etc. requirements, program logic manuals,
methods
![Page 4: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/4.jpg)
4
Requirements– Requires:
• Monitoring toolkit installed on Linux monitoring host– Host provided & administered by monitoring site personnel– No need for root privileges– Appropriate iperf, bbftp etc. ports to be opened– SLAC can do initial install & configuration for monitoring host
» 50 line configuration file for each remote host, tells where directories, applications are located, options for various tools etc (mainly defaults)
• Small toolkit installed at remote (monitored hosts)• Ssh access to an account at remote hosts
– This is the biggest problem with deployment
![Page 5: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/5.jpg)
5
Achievable throughput & file transfer
• IEPM-BW– High impact (iperf, bbftp, GridFTP …) measurements 90+-15 min intervals
Select focal area
Fwd route change
Rev route change
Min RTT
Iperf
bbftpiperf1
abing
Avg RTT
![Page 6: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/6.jpg)
6
Visualization: traceroutes• Compact table to see correlations between many
routes• Identify significant changes in routes
– Differences in > 1 hop, NOT same first 3 octets, NOT same AS
• Report all traceroute pathologies:– ! Annotations, ICMP checksum errs, non-responding
interfaces, unreachable end host, stutters, multi-homed end host
• Note, we observe:– most route changes (>98%) do not result in significant
performance changes– Many performance changes (~50+-20%) are NOT due to
route changes• Applications, host congestion, level 2 changes etc.
![Page 7: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/7.jpg)
7
Route table Example• Compact so can see many routes at once
History navigation
Multiple route changes (due to GEANT), later restored to original route
Available bandwidthRaw traceroute logs for debugging
Textual summary of traceroutes for email to ISPDescription of route numbers with date last seen
User readable (web table) routes for this host for this day
Route # at start of day, gives idea of root stability
Mouseover for hops & RTT
![Page 8: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/8.jpg)
8
Another example
TCP probe type
Host not pingable
Intermediate router does not
respondICMP checksum
error
Level change
Get AS information for routes
![Page 9: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/9.jpg)
9
Topology• Choose times and hosts and submit request
DLCLRC
CLRC
IN2P3
CESnet
ESnet
JAnetGEA
NT
Nodes colored by ISPMouseover shows node namesClick on node to see subroutesClick on end node to see its path backAlso can get raw traceroutes with AS’
Alternate rt
SLAC
Alternate routeHour of day
![Page 10: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/10.jpg)
10
IEPM-BW HENP
Deployment June 2004
• Measurements from SLAC & FNAL– BaBar, CMS, D0, CDF +
• 60-70 remote hosts in 12 countries
• Toolkits needed in monitor & remote hosts
Range of bandwidths:500Kbps to 1 Gbps
![Page 11: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/11.jpg)
11
Working on:• Provide more options for security for remote hosts • Web services API access to data• Provide & integrate low network utilization tool:
– ~ 25% of Abilene traffic is net measurement• Automate detection of anomalous step changes in
performance• Evaluate using QOS or HSTCP-LP to reduce impact
of iperf traffic– Evidence that causes packet loss (ESnet/FNAL/SLAC)
![Page 12: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/12.jpg)
12
Simplify remote security • Currently use ssh to start, kill servers, check
things etc.• Instead run servers all time at remote host
– Check & restart with cron job– Also kill hung processes with cron jobs– More work for remote admin– More difficult to check why things not working
• NASA very hard to get account (requires training etc.), so this will be a work-around
![Page 13: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/13.jpg)
13
Data Access• Interactive web accessible
– Most data can be downloaded in space or comma separated etc. (accessible via link or to program (e.g. using lynx to access URL))
– However non standard• Web services (GGF NMWG definitions)
• Working (with Warren Matthews/GATech/I2) on defining / providing access to traceroutes for AMP & IEPM-LITE
• MonALISA is accessing data via Web services
Characteristic Toolnamepath.bandwidth.achievable.TCP iperfpath.bandwidth.achievable.TCP.multiStream Iperf,bbftp, bbcp,
GridFTP
Characteristic Toolnamepath.bandwidth.capacity ABwEpath.bandwidth.utilization ABwE
![Page 14: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/14.jpg)
14
Low impact bandwidth measurement• Goals:
– Make a measurement in < second rather than tens of seconds
– Injects little network traffic– Provide reasonable agreement with more intense methods
(e.g. iperf)• Enables:
– Measurements of low performance links (e.g. to developing countries)
– Helps avoid need for scheduling– More frequent measurements (minutes vs. hours)– Lower impact more friendly
![Page 15: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/15.jpg)
15
Low impact Bandwidth• Use 20 packet pairs to roughly estimate dynamic bw Capacity &
Xtraffic, then Available = Capacity – Xtraffic– Capacity min pair separation; Xtraffic packet pair dispersion
Dynamic bandwidth capacity (DBC)
Available bandwidth =DBC – X-traffic
Cross-traffic
Iperf
ABwE SLAC to Caltech Mar 19, 2004
![Page 16: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/16.jpg)
16
Anomalous Event Detection• Too many graphs to scan by hand, need to automate
– SLAC Caltech link performance dropped by factor 5 for ~ month before noticed, fixed within 4 hours of reporting
• Looking for long-term step down changes in bandwidth• Use modified “plateau” algorithm from NLANR
– Divide data into history & trigger buffer– If y < h – * h then trigger, else history (
• When trigger buffer fills: if t < * h, then have an event
![Page 17: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/17.jpg)
17
Anomalous Event Detection• Length of trigger buffer () determines how long a step
down must last before being interesting, we use 1 to 3 hours– E.g. 20 mins saw 9 events, 40mins saw 3, 60mins none
• Works well unless strong (>40%) diurnal changes– Next step incorporate diurnal checks
l=1800 mins, =20 mins, = 2
0100200300400500600700800900
1000
4/9/04 0:00 4/9/0412:00
4/10/040:00
4/10/0412:00
4/11/040:00
4/11/0412:00
4/12/040:00
Ban
dwid
th M
bits
/s
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
EWMA(Abw) EWMA(Xtr) EWMA(Cap) event
EWMA(Abw )
EWMA(Xtr )
EWMA(Cap )
Events caused by application on Caltech host (not network related)
![Page 18: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/18.jpg)
18
Putting it together
Bandwidth from SLAC to Supernet.org June 2, 2004
0
200
400
600
800
1000
6/2/
040:
00
6/3/
040:
00
Ban
dwid
th in
Mbi
ts/s
Xtr
Abw
Cap
mh - 2 oh
mh
Route changes
mh=954Mbits/s, mt=753Mbits/s(mh-mt)/(sqrt((oh**2+o t**2)/2))=2.4
sensitivity = 2; threshold 40%l history buffer length = 600trigger buffer length = 60
ESnetCENIC
Abilene
SLAC
SupernetSOX
![Page 19: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/19.jpg)
19
Future plans• Looking for funding…• Integrate it all• Improve distribution and management tools• Add monitoring sites e.g. HENP tier 0 & 1 sites such
as CERN, BNL, IN2P3, DESY …; ESnet, StarLight, Caltech …
• Add extra functionality:– Improved event detection
• include diurnals, multivariate– Filter alerts– Upon detecting anomaly gather relevant information
(network, host etc.) including on-demand measurements (e.g. NDT) and prepare web page & email
– Improved web services access
![Page 20: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/20.jpg)
20
Thanks: Development• Jiri Navratil (Prague) – bandwidth estimation (ABwE)
• Paola Grosso (SLAC) & Warren Matthews (GATech) - web services
• Maxim Grigoriev (FNAL) – event detection, IEPM visualization, major monitoring site
• Ruchi Gupta (Stanford) – event visualization
• Prof Arshad Ali & Fahad Khalid (NIIT, Pakistan) – data collection after event
• Rich Carlson (I2), NDT
![Page 21: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/21.jpg)
21
Thanks: on-going• Foreign:
– Andrew Daviel (TRIUMF), Simon Leinen (SWITCH), Olivier Martin (CERN), Sven Ubik (CESnet), Kars Ohrenberg (DESY), Bruno Hoeft (FZK), Dominique (IN2P3), Fabrizio Coccetti (INFN), Cristina Bulfon (INFN), Yukio Karita (KEK), Takashi Ichihara (RIKEN), Yoshinori Kitasuji (APAN), Antony Antony (NIKHEF), Arshad Ali (NIIT), Serge Belov (BINP), Robin Tasker (DL & RAL), Yee Ting Lee (UCL), Richard Hughes-Jones (Manchester)
• US– Shawn McKee (Michigan), Tom Hacker (Michigan), Eric
Boyd (I2), Stanislav Shalunov (SOX), George Uhl (GSFC), Brian Tierney (LBNL), John Hicks (Indiana), John Estabrook (UIUC), Maxim Grigoriev (FNAL), Joe Izen (UT Dallas), Chris Griffin (U Florida), Tom Dunigan (ORNL), Dantong Yu (BNL), Suresh Singh (Caltech), Chip Watsom (JLab), Robert Lukens (JLab), Shane Canon (NERSC), Kevin Walsh (SDSC), David Lapsley (MIT/Haystack/ISI-E)
![Page 22: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/22.jpg)
22
More information• IEPM-BW home page
– http://www-iepm.slac.stanford.edu/bw/• Comparison of Internet E2E Measurement
infrastructures;– http://www-iepm.slac.stanford.edu/grp/scs/net/proposals/
infra-mon.html• ABwE lightweight bandwidth estimation
– http://www-iepm.slac.stanford.edu/abing/ • Anomalous Event Detection
– www.slac.stanford.edu/grp/scs/net/papers/sigcomm2004/nts26-logg.pdf
• IEPM Web Services– http://www-iepm.slac.stanford.edu/tools/web_services/
![Page 23: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/23.jpg)
23
Extra Slides
![Page 24: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/24.jpg)
24
Web Services• See http://www-iepm.slac.stanford.edu/tools/web_services/ • Working for: RTT, loss, capacity, available bandwidth, achievable throughput• No schema defined for traceroute (hop-list)• PingER
– Definition WSDL– http://www-iepm.slac.stanford.edu/tools/soap/wsdl/PINGER_profile.wsdl
• path.delay.roundTrip ms (min/avg/max + RTTs), • path.loss.roundTrip• IPDV(ms),• <definitions name="PINGER"
targetNamespace="http://www-iepm.slac.stanford.edu/tools/soap/wsdl/PINGER_profile.wsdl">
• <message name="GetPathDelayRoundTripInput">• <part name="startTime" type="xsd:string"/>• <part name="endTime" type="xsd:string"/>• <part name="destination" type="xsd:string"/>• </message>• Also dups, out of order, IPDV, TCP thru estimate• Require to provide packet size, units, timestamp, sce, dst
– path.bandwidth.available, path.bandwidth.utilized, path.bandwidth.capacity• Mainly for recent data, need to make real time data accessible• Used by MonALISA so need coordination to change definitions
![Page 25: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/25.jpg)
25
Perl access to PingER
![Page 26: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/26.jpg)
26
PingER WSDL
![Page 27: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/27.jpg)
27
Output from script
![Page 28: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/28.jpg)
28
Perl AMP traceroute
![Page 29: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/29.jpg)
29
AMP traceroute output
![Page 30: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/30.jpg)
30
Intermediate term access
• Provide access to analyzed data in tables via .tsv format download from web pages.
![Page 31: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/31.jpg)
31
Bulk Data• For long term detailed data, we tar and zip the
data on demand. Mainly for PingER data.
![Page 32: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/32.jpg)
32
AbWEIperf
28 days bandwidth history. During this time we can see several different situations caused by
different routing from SLAC to CALTECH
Drop to 100 Mbits/s by Routing (BGP) errors
Drop to 622 Mbits/s path
back to new CENIC path
New CENIC path 1000 Mbits/s
Reverse Routing changes
Forward Routing changes
Scatter plot graphs of Iperf versus ABw on different paths (range 20–800 Mbits/s) showing agreement of two methods
(28 days history)
RTT
BbftpIperf 1 stream
![Page 33: End-to-end Monitoring of High Performance Network Paths](https://reader035.vdocuments.mx/reader035/viewer/2022062323/56815bd8550346895dc9c786/html5/thumbnails/33.jpg)
33
Changes in network topology (BGP) can result in dramatic changes in performance
Snapshot of traceroute summary table
Samples of traceroute trees generated from the table
ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am
Drop in performance(From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech )
Back to original path
Changes detected by IEPM-Iperf and AbWE
Esnet-LosNettos segment in the path(100 Mbits/s)
Hour
Rem
ote
host
Dynamic BW capacity (DBC)
Cross-traffic (XT)
Available BW = (DBC-XT)Mbi
ts/s
Notes:1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:002. ESnet/GEANT working on routes from 2:00 to 14:003. A previous occurrence went un-noticed for 2 months4. Next step is to auto detect and notify
Los-Nettos (100Mbps)