effective diagnosis of routing disruptions from end systems

39
1 Effective Diagnosis of Routing Effective Diagnosis of Routing Disruptions from End Systems Disruptions from End Systems Ying Zhang Z. Morley Mao Ying Zhang Z. Morley Mao Ming Zhang Ming Zhang

Upload: shana

Post on 22-Jan-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Effective Diagnosis of Routing Disruptions from End Systems. Ying Zhang Z. Morley Mao Ming Zhang. AS A. Routing disruptions impact application performance. More applications today have high QoS requirements Routing events can cause high loss and long delays. AS B. AS C. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Effective Diagnosis of Routing Disruptions from End Systems

11

Effective Diagnosis of Routing Effective Diagnosis of Routing Disruptions from End SystemsDisruptions from End Systems

Ying Zhang Z. Morley Mao Ming Ying Zhang Z. Morley Mao Ming ZhangZhang

Page 2: Effective Diagnosis of Routing Disruptions from End Systems

Src

Routing disruptions impact Routing disruptions impact application performanceapplication performance More applications today have high QoS requirementsMore applications today have high QoS requirements

Routing events can cause high loss and long delaysRouting events can cause high loss and long delays

AS BAS C

Internet

AS D

AS EDst

Page 3: Effective Diagnosis of Routing Disruptions from End Systems

Existing approaches to diagnose Existing approaches to diagnose routing disruptions are ISP-centricrouting disruptions are ISP-centric

Require routing data from many routers in Require routing data from many routers in ISPs ISPs

[Feldmann04, Teixeira04, Wu05][Feldmann04, Teixeira04, Wu05] Passive and accuratePassive and accurate

33

AS C

Internet

AS DAS B

BGP collectors

Page 4: Effective Diagnosis of Routing Disruptions from End Systems

Limitations of ISP-centric Limitations of ISP-centric approachesapproaches

Difficult to gain access to data from many ISPsDifficult to gain access to data from many ISPs BGP data reflects “expected” data-plane pathsBGP data reflects “expected” data-plane paths

44

AS C

Internet

AS DAS B

End-systems

? ??

? ?? ?

ISP

Page 5: Effective Diagnosis of Routing Disruptions from End Systems

Can we diagnose entirely from end Can we diagnose entirely from end systems?systems? Goal: infer data-plane paths of many routersGoal: infer data-plane paths of many routers

55

Dst

ISP AAS B

AS C

AS D

Probing host

Page 6: Effective Diagnosis of Routing Disruptions from End Systems

Our approach: end systems based Our approach: end systems based monitoringmonitoring Only require probing from end hostsOnly require probing from end hosts Cover all the Cover all the PoPPoPs of a target ISPs of a target ISP

66

Dst

Target ISP

AS B

AS C

AS D

Probing host

Page 7: Effective Diagnosis of Routing Disruptions from End Systems

Our approach: end systems based Our approach: end systems based monitoringmonitoring Cover most of the destinations on the Cover most of the destinations on the

InternetInternet

77

ISP AAS B

AS C

AS D

Probing host

Dst

DstDst

Dst

Page 8: Effective Diagnosis of Routing Disruptions from End Systems

Our approach: end systems based Our approach: end systems based monitoringmonitoring Identify routing changes by comparing Identify routing changes by comparing

paths measured consecutivelypaths measured consecutively

88

Dst

ISP AAS BAS C

AS D

Probing host

Page 9: Effective Diagnosis of Routing Disruptions from End Systems

Advantages and challengesAdvantages and challenges

Advantages:Advantages:No need to access to ISP-propriety dataNo need to access to ISP-propriety dataIdentify actual data-plane pathsIdentify actual data-plane pathsMonitor data plane performanceMonitor data plane performance

Challenges:Challenges:Limited resources to probeLimited resources to probe

Coverage of probed pathsCoverage of probed pathsTiming granularityTiming granularity

Measurement noiseMeasurement noise

99

Page 10: Effective Diagnosis of Routing Disruptions from End Systems

System architectureSystem architecture

1010

Event identification and classification

Event identification and classification

Collaborative probing

Collaborative probing

Event correlation and inference

Event correlation and inference

Event impact analysisEvent impact analysis

Reports

Target ISP

Target ISP

Target ISP

Page 11: Effective Diagnosis of Routing Disruptions from End Systems

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

1111

Page 12: Effective Diagnosis of Routing Disruptions from End Systems

Collaborative probingCollaborative probing Using a set of hosts Using a set of hosts

To learn the routing state To learn the routing state To improve coverage To improve coverage To reduce overheadTo reduce overhead

1212

ISP AAS B

AS C

AS D

Probing host

Page 13: Effective Diagnosis of Routing Disruptions from End Systems

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

1313

Page 14: Effective Diagnosis of Routing Disruptions from End Systems

Event classificationEvent classification

Classify events according to ingress/egress Classify events according to ingress/egress changeschanges

1414

Destination Prefix P

Target ISP

Probing host

Type1: Ingress PoP changesType2: Ingress PoP same, egress PoP different

Type3: Ingress PoP same, egress PoP same

Page 15: Effective Diagnosis of Routing Disruptions from End Systems

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

1515

Page 16: Effective Diagnosis of Routing Disruptions from End Systems

Likely causes: link failuresLikely causes: link failures

16161616

Destination Prefix P

Target ISPOld path New path

Probing host

Old egress PoP New egress PoP

Neighbor AS

Page 17: Effective Diagnosis of Routing Disruptions from End Systems

Likely causes: internal distance Likely causes: internal distance changeschanges

17171717

distance: 120

Probing host

Old egress PoP New egress PoP

Hot potato changes Hot potato changes Cost of old internal path increasesCost of old internal path increases Cost of new internal path decreasesCost of new internal path decreases

Neighbor AS

distance: 80distance: 100 distance: 120

Page 18: Effective Diagnosis of Routing Disruptions from End Systems

Event correlationEvent correlation

Spatial correlation: a single network Spatial correlation: a single network failure often affects multiple routersfailure often affects multiple routers

Temporal correlation: routing events Temporal correlation: routing events occurring close together are likely occurring close together are likely due to only a few causesdue to only a few causes

1818

Page 19: Effective Diagnosis of Routing Disruptions from End Systems

Inference methodologyInference methodology An evidence: an event that supports the An evidence: an event that supports the

causecause

1919

Destination prefix P

Target ISP Probing host

New path

Probing host

New egressCause: Link L is down

Link L

Page 20: Effective Diagnosis of Routing Disruptions from End Systems

Inference methodologyInference methodology A conflict: a measurement trace that A conflict: a measurement trace that

conflicts with the causeconflicts with the cause

2020

Destination prefix P

Target ISP Probing host

New path

Probing host

New egressCause: Link L is down

Link L

Page 21: Effective Diagnosis of Routing Disruptions from End Systems

Inference methodologyInference methodology

2121

Evidence node[1,2,3]->[1,2,4]

Cause: link 2-3 down

Cause: node 3 withdraws the

route

AS 1

AS 2

AS 3 AS 4Withdrawal

Page 22: Effective Diagnosis of Routing Disruptions from End Systems

Inference methodologyInference methodology

2222

Evidence node[1,2,3]->[1,2,4]

Evidence node[0,2,3]->[0,2,4]

Cause: link 2-3 down

Cause: node 3 withdraws the

route

Evidence Graph

AS 1

AS 2

AS 3 AS 4

AS 0

Withdrawal

Page 23: Effective Diagnosis of Routing Disruptions from End Systems

Inference methodologyInference methodology

2323

Conflict node[1,2,3,6]

Cause: link 2-3 down

Cause: node 3 withdraws the route

Conflict node[0,2,3,6]

Conflict Graph

Conflict node[0,2,3]

AS 1

AS 2

AS 3

AS 0

AS 6

Page 24: Effective Diagnosis of Routing Disruptions from End Systems

Inference methodologyInference methodology

2424

Evidence node[1,2,3]->[1,2,4]

Evidence node[0,2,3]->[0,2,4]

Conflict node[1,2,3,6]

Conflict node[0,2,3,6]

Evidence Graph Conflict Graph

Conflict node[0,2,3]

Greedy algorithm: minimum set of causes that can Greedy algorithm: minimum set of causes that can explain all the evidence while minimizing conflictsexplain all the evidence while minimizing conflicts

Evidence: 2Conflicts: 3

Evidence: 2Conflicts: 0

Page 25: Effective Diagnosis of Routing Disruptions from End Systems

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

2525

Page 26: Effective Diagnosis of Routing Disruptions from End Systems

ISPs studiedISPs studied

2626

AS Name ASN (Tier)

Periods # of Src # of PoPs

# of Probes

Probe Gap

AT&T 3/23-4/9 230 111 61453 18.3 min

Verio 4/10-4/229/13-9/22

218 46 81024 19.3 min

Deutsche Telekom

4/23-5/22 149 64 27958 17.5 min

Savvis 5/23-6/24 178 39 40989 17.4 min

Abilene 9/23-9/302/3-2/17

113 11 51037 18.4 min

Page 27: Effective Diagnosis of Routing Disruptions from End Systems

Results of event classificationResults of event classification Many events are internal changesMany events are internal changes Abilene has many ingress changesAbilene has many ingress changes

2727

Target AS

Total events (% all traces)

Diff egress

Same ingress, egress Diff ingressInternal

PoP pathExternal AS path

AT&T 0.35% 12.1% 51% 35% 11%

Verio 0.31% 27.3% 48% 19% 9.8%

Deutsche Telekom

0.66% 4.9% 8.5% 80.7% 7.2%

Savvis 0.35% 11% 45% 31% 14%

Abilene 0.24% 13.6% 37% 40% 17%

Page 28: Effective Diagnosis of Routing Disruptions from End Systems

Validation with BGP based Validation with BGP based approach [Wu05]approach [Wu05] Hot potato changes: egress point changes Hot potato changes: egress point changes

due to internal distance changes due to internal distance changes

2828

Hot potato changes

BGPbased

Our method

Both

Tier-1 AS 147 185 101(31%, 45%)

Abilene network

79 88 60(24%, 31%)

Number of incidences identified

by BGP method

Number of incidences identified

by our method

Number of incidences identified

by both

False negative,false positives

Page 29: Effective Diagnosis of Routing Disruptions from End Systems

Validation with BGP based Validation with BGP based approachapproach Session resets: peering link up/downSession resets: peering link up/down Inaccuracy reasons:Inaccuracy reasons:

Limited coverageLimited coverage Coarse-grained probingCoarse-grained probing Measurement noiseMeasurement noise

2929

Session reset

BGPbased

Our method

Both

Tier-1 AS 9 15 6(33%, 50%)

Abilene network

7 11 7(0%, 36%)

Page 30: Effective Diagnosis of Routing Disruptions from End Systems

System performanceSystem performance

Can keep up with generated routing Can keep up with generated routing statestate

Applicable for real-time diagnosis and Applicable for real-time diagnosis and mitigationmitigationReactive: construct alternate paths to Reactive: construct alternate paths to

bypass the problembypass the problemProactive: avoid paths with many historical Proactive: avoid paths with many historical

routing disruptionsrouting disruptions

3030

Page 31: Effective Diagnosis of Routing Disruptions from End Systems

ConclusionConclusion

Developed the first system to Developed the first system to diagnose routing disruptions purely diagnose routing disruptions purely from end systemsfrom end systems

Used a simple greedy algorithm on Used a simple greedy algorithm on two bipartite graphs to infer causestwo bipartite graphs to infer causes

Comprehensively validated the Comprehensively validated the accuracyaccuracy

3131

Page 32: Effective Diagnosis of Routing Disruptions from End Systems

Thank you!Thank you!

Questions?Questions?

3232

Page 33: Effective Diagnosis of Routing Disruptions from End Systems

Performance impact analysisPerformance impact analysis

End-to-end latency changes caused End-to-end latency changes caused by different types of routing eventsby different types of routing events

3333

Page 34: Effective Diagnosis of Routing Disruptions from End Systems

Validation with BGP dataValidation with BGP data

BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from a Tier-1 ISPBGP feeds from a Tier-1 ISP

The destination prefix coverage and the routing The destination prefix coverage and the routing event detection rateevent detection rate

3434

Target AS

Dst. Prefix coverage

Dst. Prefix traversing PoPs with BGP feeds

Detected events (AS change, next hop change)

Missed events(short-duration, filtering, other)

AT&T 15% 1.5% 11% (10.3%, 3.2%)

89% (75%, 13%, 1%)

Verio 18.6% 18.1% 23% (19.1%, 8.6%)

77% (73%, 4%, 0%)

Savvis 7.8% 1.1% 6% (5.8%, 0.5%) 94% (80%, 9%, 5%)

Abilene

6% 6% 21% (17.3%, 5.8%)

79% (61%, 15%, 3%)

Page 35: Effective Diagnosis of Routing Disruptions from End Systems

Event classification: Event classification: same ingress PoP, different egress same ingress PoP, different egress PoP PoP

35353535

Target ISPOld path New path

Probing host

Old egress PoP New egress PoP

Policy changesPolicy changes Local preference in the old route decreasesLocal preference in the old route decreases Local preference in the new route increasesLocal preference in the new route increases

Neighbor ASLocal Pref :

100->50

Local Pref : 60->110

Page 36: Effective Diagnosis of Routing Disruptions from End Systems

Event classification: Event classification: same ingress PoP, different egress same ingress PoP, different egress PoP PoP

36363636

Target ISPOld path New path

Probing host

Old egress PoP New egress PoP

External routing changesExternal routing changes Old route worsens due to external factors (withdrawal, longer Old route worsens due to external factors (withdrawal, longer

AS path)AS path) New route improves due to external factorsNew route improves due to external factors

AS AABCD->ABEFD BCEFD->BEFDAS B

Page 37: Effective Diagnosis of Routing Disruptions from End Systems

Event classification: Event classification: same ingress PoP, same egress same ingress PoP, same egress PoP PoP Internal PoP path changesInternal PoP path changes

Cost of old internal path increasesCost of old internal path increases Cost of new internal path decreasesCost of new internal path decreases

External AS path changesExternal AS path changes

37373737

Destination Prefix P

Target ISP

Old path New path

Probing host

Page 38: Effective Diagnosis of Routing Disruptions from End Systems

Results of cause inferenceResults of cause inference

3838

Effectiveness of inference algorithmEffectiveness of inference algorithm Clusters: a group of events with the same root Clusters: a group of events with the same root

causecause

Page 39: Effective Diagnosis of Routing Disruptions from End Systems

Event identificationEvent identification

A routing event: path changesA routing event: path changes Event identificationEvent identificationomparing continuous routing snapshotsomparing continuous routing snapshots

3939

Dst

ISP AAS BAS C

AS D

Probing host