presentation by michael smathers, usman jafarey cs395/495 imre, april 24, 2006 planetseer: internet...
TRANSCRIPT
![Page 1: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/1.jpg)
Presentation by Michael Smathers, Usman Jafarey
CS395/495 IMRE, April 24, 2006
PlanetSeer: Internet Path Failure Monitoring and Characterization in
Wide-Area Services
![Page 2: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/2.jpg)
• Large volume of traffic data required to characterize misbehavior, wide-area services
–Peer-to-peer (P2P) systems–Content distribution networks (CDN)
• Solution: Combine passive monitoring of wide area networks with active probes to quantify and characterize anomalies.
Detecting Path Anomalies
![Page 3: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/3.jpg)
• Traceroute only maps forward path; difficult to infer if problem is with forward or reverse path without destination cooperation.• BGP/OSPF propagate failure information. Traceroute may stop at a hop that is not the source of the failure.• High variance in failure duration makes it difficult to respond in time.• Few sites had enough coverage to identify all affected paths of a failure.
Traditional Detection…
![Page 4: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/4.jpg)
• More accurate, complete view of failures thanks to geographical diversity of nodes• Minimum overhead; active probing is initiated only after passive monitoring detects anomaly• High rate of failure detection thanks to large volumes of traffic
Advantages of this approach
![Page 5: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/5.jpg)
• Passively monitoring traffic on PlanetLab since February 2004 to detect anomalous behaviour
– Coordinate active probes between PlanetLab sites to confirm/characterize anomaly and measure scope
• ~90,000 anomalies confirmed each month with PlanetSeer.
PlanetLab Test Bed
![Page 6: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/6.jpg)
Wide-area service network: CoDeeN• 7-12K clients/day• 100-200GB/day• 5-7 million requests/day• 120 nodes in North America(350 world-wide)
Passive Monitoring Daemons (MonD) run on all CoDeeN nodes to detect anomalous TCP traffic behaviour.
Active Probing Daemons (ProbeD) run on all PlanetLab nodes, including CoDeeN nodes, awaitingrequests from MonDs.
Components
![Page 7: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/7.jpg)
1. MonD detects anomaly, sends request to local ProbeD.
2. ProbeD contacts ProbeDs on other nodes to coordinate planet-wide probe.
3. ProbeDs are organized in groups for distributed probe.
Operation
![Page 8: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/8.jpg)
• Uses PlanetLab's tcpdump to observe all incomingand outgoing TCP packets.
• Uses this information to generate path and flow level statistics which are used to identify possible anomalies in real-time.
• Two indicators of anomalies: – Change in TTL(Time To Live) field– Multiple consecutive timeouts
Current threshold: 4 timeoutsIf MonD is on receiving side, ACKs not
reaching sender. We can assume forward path is at
fault.If MonD is sender, we cannot determine
from timeouts which path contains the problem.
MonD - Operation
![Page 9: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/9.jpg)
When MonD is sender, maintain two variables for each flow:
• SendSeqNo, sequence number of most recently sent packet.
• SendRtxCount, count of times the packet has been retransmitted.
• CurrentSeqNo > SendSeqNo; flow is making progress, clear SendRtxCount and set SendSeqNo to current.
• CurrentSeqNo < SendSeqNo; fast retransmit. Set SendSeqNo to current.
• CurrentSeqNo = SendSeqNo, timeout; Increment SendRtxCount. If SendRtxCount exceeds threshold, MonD notifies ProbeD of possible anomaly.
MonD - Timeout Detection
![Page 10: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/10.jpg)
MonD receiver side, maintain largest seq. no per flow. If current packet has same seq. no, increment counter.
When counter hits threshold notify ProbeD that sender is not seeing ACKs.
MonD - cont’d…
![Page 11: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/11.jpg)
Three probing operations:
1. Baseline probes, run when new IP is added to MonD path table.
2. Forward probes, traceroutes invoked at multiple geographically distributed nodes when MonD detects anomaly. Rate limited, ProbeD will not forward probe the same destination more than once in 10 minutes.
3. Reprobes, if anomaly is confirmed by forward probe, reprobes sent by initial ProbeD to determine duration and effects of anomaly. Reprobes sent at .5, 1.5, 3.5 and 7.5 hours after anomaly detection time. Reprobes compared to original baseline and forward probes.
ProbeD
![Page 12: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/12.jpg)
• 353 ProbeDs running on 145 PlanetLab sites.• Distributed across North/South America, Europe,
Asia and elsewhere.• Membership information kept for ProbeDs to avoid
unnecessary communication to dead nodes.• 30 ProbeD node groups based on geographic
diversity.• ProbeD receives request from local MonD, then
– forwards request to one ProbeD from each group– ProbeDs perform probe, send results to requester.– originator collects data
ProbeD - Operation
![Page 13: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/13.jpg)
• 887,521 unique client IPs from 9232 ASes.• Probes traversed 10090 ASes. (over half the
ASes on the Internet) • 2,259,558 possible anomalies• 271,898 confirmed
ProbeD - Dataset
![Page 14: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/14.jpg)
• Unusable hops identified by * in place of name, removed. Relative hop count maintained.
• Missing hops found by comparing traceroutes that share destination.
Repairing Traceroute Data
![Page 15: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/15.jpg)
Anomily confirmed if any of the following conditions are met:
• There is a loop in the traceroute • Local traceroute disagrees with baseline• Local traceroute doesn't reach destination but
other traceroutes make it• Traceroute returns ICMP destination
unreachable
Anomoly Detection
![Page 16: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/16.jpg)
• Detected if same sequence observed at least 3 times in a traceroute.
• Persistent loops, traceroute stays in loops until max hops.
• Temporary loops, loops resolved before max hops.
• Reprobes determine duration of persistent loop.
Routing Loops
![Page 17: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/17.jpg)
• Number of routers/AS involved in loop.• Loop length – number of routers involved• Temporary loops longer lengths than
persistent• Persistent loops generally involve single AS• Loops mapped by tiers of AS involved
Measuring Scope
![Page 18: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/18.jpg)
• Temporary loops overload routers• Persistent loops cause loss of connectivity• Degrade latency
Loop Effects
![Page 19: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/19.jpg)
• Distinguish between forward/reverse anomalies• Scope of anomaly; hops between anomoly & end host• Classify as either path change or path outage
• Evaluating Reference Paths– Hazards; destination behind firewall, intermediate router
filtering– Firewall heuristics; choosing appropriate distance n between
host & anomaly• 0 < RevHop(dst) - RevHop(Sx) < n
Reference Paths
![Page 20: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/20.jpg)
• Comparing reference path (R) with local path (L)– Path change; L reaches last hop of R– Path outage; L cuts out before R– Path outage + Path change; L diverges from R,
arrives at R’s last hop
• Breakdown of all anomalies observed:– Path Change: 48%– Forward Outage: 10%– Other: 24%– Temporary: 18%
Non-Loop Anomalies
![Page 21: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/21.jpg)
• Define scope; # hops on R that could change next hop value
• Remote traceroute from various locations, find Intercept path– Intercept path narrows scope
• Find relative location of anomaly, i.e. near host– Find distance of path change by average distances
of all paths in scope
Path Changes
![Page 22: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/22.jpg)
• Distinguish between forward, reverse paths• Forward path:
– Route change on forward path, in addition to outage– ICMP dest. Unreachable– Reported as timeout on forward path by MonD
• 35% anomalies found to be Fwd Timeout (inferred by MonD)
– Indistiguishable without passive/active probes
Path Outage
![Page 23: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/23.jpg)
Path Change Detection - AS
![Page 24: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/24.jpg)
• How many failures can be bypassed?– For all clients with reference path, 62815
reachability failures– Of these, PlanetSeer nodes able to reach
destination in 27263 cases (43% of failures)– Same results achieved using 15 vantage points as
all 30
• Bypass ratio; minimum RTT of any bypass path and RTT of baseline path– Improves latency in 23% of new paths
Bypassing Anomalies
![Page 25: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/25.jpg)
• BGP – misconfiguration classification– Locate origin via time, prefix, view
• Traceroute; Path symmetry; 49% asymmetric, 91% persist for more than several hours
• Ping/Traceroute hybrids
Related Work
![Page 26: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area](https://reader035.vdocuments.mx/reader035/viewer/2022062423/5697bf7c1a28abf838c84167/html5/thumbnails/26.jpg)
• Passive Monitoring– Enables must faster detection of anomalies
– Better resolution, temporary anomaly detection
• Failure distribution (AS topology)– Tier 1 most stable, Tier 3 least stable
• Loop Behaviour– Temporary loops have much longer lengths
– Most span 4 routers
• Path Change resolution– 63% of outages occur within 3 hops of end host
– Over half confined to 2 AS’s, 50% confined within 3 hops
• Alternate path discovery– Largely unsuccessful, most outages near network edge lack any
redundancy
Conclusions