making network tomography practical renata teixeira laboratoire lip6 cnrs and upmc paris universitas

Making Network Tomography Practical

Renata TeixeiraLaboratoire LIP6

CNRS and UPMC Paris Universitas

Internet monitoring is essential

For network operators– Monitor service-level agreements

– Troubleshoot failures

– Diagnose anomalous behavior

For users or content/application providers– Verify network performance

2

Challenge 1: Nobody controls end-to-end path

Network operators only have data of one AS End-hosts can only monitor end-to-end paths

3

AS1

AS2AS3

AS4

Challenge 2:Available data not direct

Network operators

Is my network performance good?

– Only have per-link counts or active probes

Is there a problem? Where?

– There may be no alarm

Users, applications

Is my provider’s performance good?

– Only have end-to-end delay and loss

4

Network tomography to rescue

Inference of unknown network properties from measurable ones

Sophisticated inference algorithms – Given a model and available measurements

– Apply statistical inference to estimate properties• Maximum likelihood estimator, Bayesian inference

Unfortunately, limited practical deployment– Measuring the required inputs is difficult

5

This tutorial

Monitoring techniques to make network tomography practical

6

Outline

Examples of network tomography problems

Case study: fault diagnosis– Fault detection: continuous path monitoring

– Fault identification: binary tomography• Correlated path reachability

• Topology measurements

Open issues

7

8

Network tomography problems Estimation of a network’s traffic matrix

– Given total traffic in network links– What is the traffic between a network’s entry and

exit points? Inference of link performance

– Given end-to-end probes– What is the loss rate or delay of a link?

Inference of network topology– Given end-to-end loss measurements– What is the logical network topology?

Inference of link performance

What are the properties of network links?– Loss rate

– Delay

– Bandwidth

– Connectivity

Given end-to-end measurements– No access to routers

9

D F

E

A C

B

AS 2

AS 1

Multicast-based Inference of Network-internal Characteristics

Measurements– Multicast probes

– Traces collected at receivers

Inference– Exploit correlation in traces to

estimate link properties

Introduced by MINC project

10

probesender

probecollectors

Inferring link loss rates

Assumptions– Known, logical-tree topology

– Losses are independent

– Multicast probes

Methodology– Maximum likelihood

estimates for αk

11

1 10 11 1

α1

α2 α3

α1^ α2^ α3^

m

t1 t2

successprobabilities

estimatedsuccess

probabilities

Binary tomography

Labels links as good or bad– Loss rate estimation requires

tight correlation

– Instead, separate good/bad performance

– If link is bad, all paths that cross the link are bad

12

1 10 10 1

α1

α2 α3

m

t1 t2

goodbad

Single-source tree

“Smallest Consistent Failure Set” algorithm

– Assumes a single-source tree and known topology

– Find the smallest set of links that explains bad paths• Given bad links are uncommon

• Bad link is the root of maximal bad subtree

13

m

t1 t2

bad

1 10 10 1

goodbad

Binary tomography with multiple sources and targets

Problem becomes NP-hard– Minimum hitting set problem

• Hitting set of a link = paths that traverse the link

Iterative greedy heuristic– Given the set of links in bad paths

– Iteratively choose link that explains the max number of bad paths

Promising for fault identification

14

m2

t1 t2

m1

Practical issues

Topology is often unknown – Need to measure accurate topology

Limited deployment of multicast– Need to extract correlation from unicast probes– Even using probes from different monitors

Control of targets is not always practical– Need one-way performance from round-trip probes Links can fail for some paths, but not all– Need to extend tomography algorithms

15

Outline





Open issues

16

17

Steps of fault diagnosis

AS1

AS2AS3

AS4

Detection: continuous path monitoring

Identification: binary tomography

FAULT DETECTION

18

Detection techniques

Active probing: ping– Send probe and collect response– No control of targets

Passive analysis of user’s traffic– tcpdump: tap all incoming and outgoing packets– Monitoring of TCP connections

19

Detection with ping

If receives reply– Then, path is good

If no reply before timeout– Then, path is bad

20

m

tprobeICMP

echo request

replyICMP

echo reply

Persistent failure or measurement noise?

Many reasons to lose probe or reply– Timeout may be too short

– Rate limiting at routers

– Some end-hosts don’t respond to ICMP request

– Transient congestion

– Routing change

Need to confirm that failure is persistent– Otherwise, may trigger false alarms

21

Upon detection of a failure, trigger extra probes Goal: minimize detection errors

– Sending more probes – Waiting longer between probes

Tradeoff: detection error and detection time

22

Failure confirmation

time

loss burstpackets on

a path

Detection error

Passive detection tcpdump captures all packets Track status of each TCP connection

– RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad

23

– If current seq. number > last seq. number seen• Path is good

– If current seq. number = last seq. number seen• Timeout has occurred • After four timeouts, declare path as bad

Passive vs. active detectionPassive

+ No need to inject traffic+ Detects all failures that

affect user’s traffic+ Responses from targets

that don’t respond to ping

Active

+ No need to tap user’s traffic + Detects failures in any desired path

24

‒ Not always possible to tap user’s traffic

‒ Only detects failures in paths with traffic

‒ Probing overhead– Cover a large number of paths– Detect failures fast

25

Active monitoring: reducing probing overhead

M1

M2

T3

T1 T2

A C

BD

target hosts

monitors Goal detect failures of any of the

interfaces in the target networkwith minimum probing overhead

target network

26

Simple solution: Coverage problem

M1

M2

T3

T1 T2

A C

BD

Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network

Coverage problem is NP-hard

– Solution: greedy set-cover heuristic

27

Coverage solution doesn’t detect all types of failures

Detects fail-stop failures– Failures that affect all packets that traverse

the faulty interface• Eg., interface or router crashes, fiber cuts, bugs

But not path-specific failures– Failures that affect only a subset of paths

that cross the faulty interface• Eg., router misconfigurations

28

New formulation of failure detection problem

Select the frequency to probe each path– Lower frequency per-path probing can achieve a

high frequency probing of each interface

M1

M2

T3

T1 T2

A C

BD

1 every 9 mins

1 every 3 mins

Is failure in forward or reverse path?

Paths can be asymmetric– Load balancing

– Hot-potato routing

29

m

tprobe

reply

Disambiguating one-way losses: Spoofing

Monitor requests to spoofer to send probe

Probe has IP address of the monitor

If reply reaches the monitor, reverse path is good

30

m

t

Spoofer: Send spoofed packet with source address of m

Spoofer

Summary: Fault detection

Techniques to measure path reachability– Active probing: ping + failure confirmation– Passive analysis of TCP connections

Reducing overhead of active monitoring– Select the set of paths to probe– Trade-off: set of paths and probing frequency

No control of targets– Only have round-trip measurements– Spoofing differentiates forward/reverse failures

31

FAULT IDENTIFICATION: CORRELATED PATH REACHABILITY

32

Uncorrelated measurements lead to errors

Lack of synchronization leads to inconsistencies

– Probes cross links at different times

– Path may change between probes

33

m

t1 t2

mistakenly inferred failure

34

Sources of inconsistencies

In measurements from a single monitor– Probing all targets can take time

In measurements from multiple monitors– Hard to synchronize monitors for all probes to reach

a link at the same time– Impossible to generalize to all links

Inconsistent measurements with multiple monitors

35

m1

t1

tN

mK

…

…

mK,t1

mK, tN

…m1,t1

m1, tN

…

path reachability

good

good

…

good

bad…

inconsistent measurements

Solution: Reprobe paths after failure

36

Consistency has a cost– Delays fault identification

– Cannot identify short failures

m1

t1

tN

mK

…

…

mK,t1

mK, tN

…

m1,t1

m1, tN

…

path reachability

good

bad

…

good

bad

…

Summary: Correlated measurements

Correlation is essential to tomography– Lack of correlation leads to false alarms

Correlation is hard with unicast probes– Probing multiple targets takes time

– Multiple monitors cannot probe a link simultaneously

Solution: probe paths again after fault detection– Trade-off: consistency vs. detection speed

37

FAULT IDENTIFICATION: ACCURATE TOPOLOGY

38

Measuring router topology

With access to routers (or “from inside”) – Topology of one network

– Routing monitors (OSPF or IS-IS)

No access to routers (or “from outside”)– Multi-AS topology or from end-hosts

– Monitors issue active probes: traceroute

39

40

Topology from inside

Routing protocols flood state of each link– Periodically refresh link state

– Report any changes: link down, up, cost change

Monitor listens to link-state messages– Acts as a regular router

• AT&T’s OSPFmon or Sprint’s PyRT for IS-IS

Combining link states gives the topology– Easy to maintain, messages report any changes

Inferring a path from outside: traceroute

41

A B

TTL = 1

A.1 A.2 B.2B.1

TTL = 2

TTL exceeded from A.1

TTL exceeded from B.1

Actual path

Inferred path

A.1 B.1

m t

m t

A traceroute path can be incomplete

Load balancing is widely used– Traceroute only probes one path

Sometimes taceroute has no answer (stars)– ICMP rate limiting

– Anonymous routers

Tunnelling (e.g., MPLS) may hide routers– Routers inside the tunnel may not decrement TTL

42

43

Traceroute under load balancing

L

B

A C

D

L

A

D

C

TTL = 2

TTL = 3

B

E

E

Missing nodes and links

False link

Actual path

Inferred path

m

m t

t

44

Errors happen even under per-flow load balancing

L

B

A C

D

TTL = 2Port 2

TTL = 3Port 3

E

Traceroute uses the destination port as identifier Per-flow load balancers use the destination port as part of the flow identifier

Flow 1

m t

45

Paris traceroute Solves the problem with per-flow load balancing

– Probes to a destination belong to same flow

Changes the location of the probe identifier– Use the UDP checksum

L

B

A C

D

TTL = 2Port 1

TTL = 3Port 1

EChecksum 3Checksum 2m t

42 1

1

Topology from traceroutes

Inferred nodes = interfaces, not routers

Coverage depends on monitors and targets – Misses links and routers– Some links and routers appear multiple times

46

1 A

D

3B 2

3

2

3 1m1

t1

m2

t2

C

Actual topology

A.1m1t1

m2t2

Inferred topology

C.1D.1

C.2

B.3

2

Alias resolution: Map interfaces to routers

Direct probing– Probe an interface, may receive

response from another

– Responses from the same router will have close IP identifiers and same TTL

Record-route IP option– Records up to nine IP

addresses of routers in the path

47

A.1m1t1

m2t2

Inferred topology

C.1D.1

C.2

B.3

same router

Large-scale topology measurements

Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes

takes 5 minutes on average (using 30 threads)– Probing more targets covers more links– But, getting a topology snapshot takes longer

Snapshot may be inaccurate– Paths may change during snapshot

Hard to get up-to-date topology– To know that a path changed, need to re-probe

48

Faster topology snapshots

Probing redundancy– Intra-monitor

– Inter-monitor

Doubletree– Combines backward and

forward probing to eliminate redundancy

49

A

D

B

m1

t1

m2

t2

C

Summary of techniques to measure topology

Routing messages– Complete and accurate– But, need access to routers

Combining traceroutes– Anyone can use it, no privileged access to routers– But, false or missing links and nodes

Topologies for tomography: some uncertainties– Multiple topologies close to the time of an event– Multiple paths between a monitor and a target

50

Outline





Open issues

51

Open issues

Fault detection– How to detect faults or performance degradations that impact

end-users?

– What is the overhead and speed of large-scale deployments?

– Will spoofing work in a large-scale deployments?

Fault identification– How to keep the topology up-to-date for fast identification?

– Do we need new tomography techniques to cope with partial failures?

– Could inference be easier with cooperation from routers?

52

REFERENCES

53

Network tomography theory

Survey on network tomography– R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network

Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.

Traffic matrix estimation– Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic

Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.

Inference of link performance/connectivity– MINC project: http://gaia.cs.umass.edu/minc/

– A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.

54

Binary tomography

Single-source tree algorithm– N. Duffield, “Network Tomography of Binary Network

Performance Characteristics”, IEEE Transactions on Information Theory, 2006.

Applying tomography in one network– R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren,

“Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.

Applying tomography in multiple network topology– A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,

“NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.

55

Topology from inside

IS-IS monitoring– R. Mortier, “Python Routeing Toolkit (`PyRT')”,

https://research.sprintlabs.com/pyrt/

OSPF monitoring– A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,

Design and Deployment Experience”, NSDI 2004

Commercial products– Packet Design: http://www.packetdesign.com/

56

Topology with traceroute Tracing accurate paths under load-balancing

– B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.

Reducing overhead to trace topology of a network and alias resolution with direct probing

– N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.

Use of record route to obtain more accurate topologies– R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet

Cartographer”, SIGCOMM, 2008.

Reducing overhead to trace a multi-network topology– B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient

Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.

57

Reducing overhead of active fault detection

Selection of paths to probe – H. Nguyen and P. Thiran, “Active measurement for multiple link

failures diagnosis in IP networks”, PAM, 2004.

– Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003.

Selection of the frequency to probe paths– H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing

Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.

58

Internet-wide fault detection systems

Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults

– E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.

Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults

– M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.

59

making network tomography practical renata teixeira laboratoire lip6 cnrs and upmc paris universitas

Documents

binary tomography slide

inference of network

path detection error

properties of network

network performance

inference of link performance

logical network topology

bad links