making network tomography practical renata teixeira laboratoire lip6 cnrs and upmc paris universitas
TRANSCRIPT
Making Network Tomography Practical
Renata TeixeiraLaboratoire LIP6
CNRS and UPMC Paris Universitas
Internet monitoring is essential
For network operators– Monitor service-level agreements
– Troubleshoot failures
– Diagnose anomalous behavior
For users or content/application providers– Verify network performance
2
Challenge 1: Nobody controls end-to-end path
Network operators only have data of one AS End-hosts can only monitor end-to-end paths
3
AS1
AS2AS3
AS4
Challenge 2:Available data not direct
Network operators
Is my network performance good?
– Only have per-link counts or active probes
Is there a problem? Where?
– There may be no alarm
Users, applications
Is my provider’s performance good?
– Only have end-to-end delay and loss
4
Network tomography to rescue
Inference of unknown network properties from measurable ones
Sophisticated inference algorithms – Given a model and available measurements
– Apply statistical inference to estimate properties• Maximum likelihood estimator, Bayesian inference
Unfortunately, limited practical deployment– Measuring the required inputs is difficult
5
This tutorial
Monitoring techniques to make network tomography practical
6
Outline
Examples of network tomography problems
Case study: fault diagnosis– Fault detection: continuous path monitoring
– Fault identification: binary tomography• Correlated path reachability
• Topology measurements
Open issues
7
8
Network tomography problems Estimation of a network’s traffic matrix
– Given total traffic in network links– What is the traffic between a network’s entry and
exit points? Inference of link performance
– Given end-to-end probes– What is the loss rate or delay of a link?
Inference of network topology– Given end-to-end loss measurements– What is the logical network topology?
Inference of link performance
What are the properties of network links?– Loss rate
– Delay
– Bandwidth
– Connectivity
Given end-to-end measurements– No access to routers
9
D F
E
A C
B
AS 2
AS 1
Multicast-based Inference of Network-internal Characteristics
Measurements– Multicast probes
– Traces collected at receivers
Inference– Exploit correlation in traces to
estimate link properties
Introduced by MINC project
10
probesender
probecollectors
Inferring link loss rates
Assumptions– Known, logical-tree topology
– Losses are independent
– Multicast probes
Methodology– Maximum likelihood
estimates for αk
11
1 10 11 1
α1
α2 α3
α1^ α2^ α3^
m
t1 t2
successprobabilities
estimatedsuccess
probabilities
Binary tomography
Labels links as good or bad– Loss rate estimation requires
tight correlation
– Instead, separate good/bad performance
– If link is bad, all paths that cross the link are bad
12
1 10 10 1
α1
α2 α3
m
t1 t2
goodbad
Single-source tree
“Smallest Consistent Failure Set” algorithm
– Assumes a single-source tree and known topology
– Find the smallest set of links that explains bad paths• Given bad links are uncommon
• Bad link is the root of maximal bad subtree
13
m
t1 t2
bad
1 10 10 1
goodbad
Binary tomography with multiple sources and targets
Problem becomes NP-hard– Minimum hitting set problem
• Hitting set of a link = paths that traverse the link
Iterative greedy heuristic– Given the set of links in bad paths
– Iteratively choose link that explains the max number of bad paths
Promising for fault identification
14
m2
t1 t2
m1
Practical issues
Topology is often unknown – Need to measure accurate topology
Limited deployment of multicast– Need to extract correlation from unicast probes– Even using probes from different monitors
Control of targets is not always practical– Need one-way performance from round-trip probes Links can fail for some paths, but not all– Need to extend tomography algorithms
15
Outline
Examples of network tomography problems
Case study: fault diagnosis– Fault detection: continuous path monitoring
– Fault identification: binary tomography• Correlated path reachability
• Topology measurements
Open issues
16
17
Steps of fault diagnosis
AS1
AS2AS3
AS4
Detection: continuous path monitoring
Identification: binary tomography
FAULT DETECTION
18
Detection techniques
Active probing: ping– Send probe and collect response– No control of targets
Passive analysis of user’s traffic– tcpdump: tap all incoming and outgoing packets– Monitoring of TCP connections
19
Detection with ping
If receives reply– Then, path is good
If no reply before timeout– Then, path is bad
20
m
tprobeICMP
echo request
replyICMP
echo reply
Persistent failure or measurement noise?
Many reasons to lose probe or reply– Timeout may be too short
– Rate limiting at routers
– Some end-hosts don’t respond to ICMP request
– Transient congestion
– Routing change
Need to confirm that failure is persistent– Otherwise, may trigger false alarms
21
Upon detection of a failure, trigger extra probes Goal: minimize detection errors
– Sending more probes – Waiting longer between probes
Tradeoff: detection error and detection time
22
Failure confirmation
time
loss burstpackets on
a path
Detection error
Passive detection tcpdump captures all packets Track status of each TCP connection
– RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad
23
– If current seq. number > last seq. number seen• Path is good
– If current seq. number = last seq. number seen• Timeout has occurred • After four timeouts, declare path as bad
Passive vs. active detectionPassive
+ No need to inject traffic+ Detects all failures that
affect user’s traffic+ Responses from targets
that don’t respond to ping
Active
+ No need to tap user’s traffic + Detects failures in any desired path
24
‒ Not always possible to tap user’s traffic
‒ Only detects failures in paths with traffic
‒ Probing overhead– Cover a large number of paths– Detect failures fast
25
Active monitoring: reducing probing overhead
M1
M2
T3
T1 T2
A C
BD
target hosts
monitors Goal detect failures of any of the
interfaces in the target networkwith minimum probing overhead
target network
26
Simple solution: Coverage problem
M1
M2
T3
T1 T2
A C
BD
Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network
Coverage problem is NP-hard
– Solution: greedy set-cover heuristic
27
Coverage solution doesn’t detect all types of failures
Detects fail-stop failures– Failures that affect all packets that traverse
the faulty interface• Eg., interface or router crashes, fiber cuts, bugs
But not path-specific failures– Failures that affect only a subset of paths
that cross the faulty interface• Eg., router misconfigurations
28
New formulation of failure detection problem
Select the frequency to probe each path– Lower frequency per-path probing can achieve a
high frequency probing of each interface
M1
M2
T3
T1 T2
A C
BD
1 every 9 mins
1 every 3 mins
Is failure in forward or reverse path?
Paths can be asymmetric– Load balancing
– Hot-potato routing
29
m
tprobe
reply
Disambiguating one-way losses: Spoofing
Monitor requests to spoofer to send probe
Probe has IP address of the monitor
If reply reaches the monitor, reverse path is good
30
m
t
Spoofer: Send spoofed packet with source address of m
Spoofer
Summary: Fault detection
Techniques to measure path reachability– Active probing: ping + failure confirmation– Passive analysis of TCP connections
Reducing overhead of active monitoring– Select the set of paths to probe– Trade-off: set of paths and probing frequency
No control of targets– Only have round-trip measurements– Spoofing differentiates forward/reverse failures
31
FAULT IDENTIFICATION: CORRELATED PATH REACHABILITY
32
Uncorrelated measurements lead to errors
Lack of synchronization leads to inconsistencies
– Probes cross links at different times
– Path may change between probes
33
m
t1 t2
mistakenly inferred failure
34
Sources of inconsistencies
In measurements from a single monitor– Probing all targets can take time
In measurements from multiple monitors– Hard to synchronize monitors for all probes to reach
a link at the same time– Impossible to generalize to all links
Inconsistent measurements with multiple monitors
35
m1
t1
tN
mK
…
…
mK,t1
mK, tN
…m1,t1
m1, tN
…
path reachability
good
good
…
good
bad…
inconsistent measurements
Solution: Reprobe paths after failure
36
Consistency has a cost– Delays fault identification
– Cannot identify short failures
m1
t1
tN
mK
…
…
mK,t1
mK, tN
…
m1,t1
m1, tN
…
path reachability
good
bad
…
good
bad
…
Summary: Correlated measurements
Correlation is essential to tomography– Lack of correlation leads to false alarms
Correlation is hard with unicast probes– Probing multiple targets takes time
– Multiple monitors cannot probe a link simultaneously
Solution: probe paths again after fault detection– Trade-off: consistency vs. detection speed
37
FAULT IDENTIFICATION: ACCURATE TOPOLOGY
38
Measuring router topology
With access to routers (or “from inside”) – Topology of one network
– Routing monitors (OSPF or IS-IS)
No access to routers (or “from outside”)– Multi-AS topology or from end-hosts
– Monitors issue active probes: traceroute
39
40
Topology from inside
Routing protocols flood state of each link– Periodically refresh link state
– Report any changes: link down, up, cost change
Monitor listens to link-state messages– Acts as a regular router
• AT&T’s OSPFmon or Sprint’s PyRT for IS-IS
Combining link states gives the topology– Easy to maintain, messages report any changes
Inferring a path from outside: traceroute
41
A B
TTL = 1
A.1 A.2 B.2B.1
TTL = 2
TTL exceeded from A.1
TTL exceeded from B.1
Actual path
Inferred path
A.1 B.1
m t
m t
A traceroute path can be incomplete
Load balancing is widely used– Traceroute only probes one path
Sometimes taceroute has no answer (stars)– ICMP rate limiting
– Anonymous routers
Tunnelling (e.g., MPLS) may hide routers– Routers inside the tunnel may not decrement TTL
42
43
Traceroute under load balancing
L
B
A C
D
L
A
D
C
TTL = 2
TTL = 3
B
E
E
Missing nodes and links
False link
Actual path
Inferred path
m
m t
t
44
Errors happen even under per-flow load balancing
L
B
A C
D
TTL = 2Port 2
TTL = 3Port 3
E
Traceroute uses the destination port as identifier Per-flow load balancers use the destination port as part of the flow identifier
Flow 1
m t
45
Paris traceroute Solves the problem with per-flow load balancing
– Probes to a destination belong to same flow
Changes the location of the probe identifier– Use the UDP checksum
L
B
A C
D
TTL = 2Port 1
TTL = 3Port 1
EChecksum 3Checksum 2m t
42 1
1
Topology from traceroutes
Inferred nodes = interfaces, not routers
Coverage depends on monitors and targets – Misses links and routers– Some links and routers appear multiple times
46
1 A
D
3B 2
3
2
3 1m1
t1
m2
t2
C
Actual topology
A.1m1t1
m2t2
Inferred topology
C.1D.1
C.2
B.3
2
Alias resolution: Map interfaces to routers
Direct probing– Probe an interface, may receive
response from another
– Responses from the same router will have close IP identifiers and same TTL
Record-route IP option– Records up to nine IP
addresses of routers in the path
47
A.1m1t1
m2t2
Inferred topology
C.1D.1
C.2
B.3
same router
Large-scale topology measurements
Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes
takes 5 minutes on average (using 30 threads)– Probing more targets covers more links– But, getting a topology snapshot takes longer
Snapshot may be inaccurate– Paths may change during snapshot
Hard to get up-to-date topology– To know that a path changed, need to re-probe
48
Faster topology snapshots
Probing redundancy– Intra-monitor
– Inter-monitor
Doubletree– Combines backward and
forward probing to eliminate redundancy
49
A
D
B
m1
t1
m2
t2
C
Summary of techniques to measure topology
Routing messages– Complete and accurate– But, need access to routers
Combining traceroutes– Anyone can use it, no privileged access to routers– But, false or missing links and nodes
Topologies for tomography: some uncertainties– Multiple topologies close to the time of an event– Multiple paths between a monitor and a target
50
Outline
Examples of network tomography problems
Case study: fault diagnosis– Fault detection: continuous path monitoring
– Fault identification: binary tomography• Correlated path reachability
• Topology measurements
Open issues
51
Open issues
Fault detection– How to detect faults or performance degradations that impact
end-users?
– What is the overhead and speed of large-scale deployments?
– Will spoofing work in a large-scale deployments?
Fault identification– How to keep the topology up-to-date for fast identification?
– Do we need new tomography techniques to cope with partial failures?
– Could inference be easier with cooperation from routers?
52
REFERENCES
53
Network tomography theory
Survey on network tomography– R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network
Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.
Traffic matrix estimation– Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic
Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.
Inference of link performance/connectivity– MINC project: http://gaia.cs.umass.edu/minc/
– A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.
54
Binary tomography
Single-source tree algorithm– N. Duffield, “Network Tomography of Binary Network
Performance Characteristics”, IEEE Transactions on Information Theory, 2006.
Applying tomography in one network– R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren,
“Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
Applying tomography in multiple network topology– A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,
“NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.
55
Topology from inside
IS-IS monitoring– R. Mortier, “Python Routeing Toolkit (`PyRT')”,
https://research.sprintlabs.com/pyrt/
OSPF monitoring– A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,
Design and Deployment Experience”, NSDI 2004
Commercial products– Packet Design: http://www.packetdesign.com/
56
Topology with traceroute Tracing accurate paths under load-balancing
– B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.
Reducing overhead to trace topology of a network and alias resolution with direct probing
– N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.
Use of record route to obtain more accurate topologies– R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet
Cartographer”, SIGCOMM, 2008.
Reducing overhead to trace a multi-network topology– B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient
Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.
57
Reducing overhead of active fault detection
Selection of paths to probe – H. Nguyen and P. Thiran, “Active measurement for multiple link
failures diagnosis in IP networks”, PAM, 2004.
– Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003.
Selection of the frequency to probe paths– H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing
Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.
58
Internet-wide fault detection systems
Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults
– E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.
Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults
– M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.
59