effective diagnosis of routing disruptions from end systems

Post on 22-Jan-2016

17 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Effective Diagnosis of Routing Disruptions from End Systems. Ying Zhang Z. Morley Mao Ming Zhang. AS A. Routing disruptions impact application performance. More applications today have high QoS requirements Routing events can cause high loss and long delays. AS B. AS C. - PowerPoint PPT Presentation

TRANSCRIPT

11

Effective Diagnosis of Routing Effective Diagnosis of Routing Disruptions from End SystemsDisruptions from End Systems

Ying Zhang Z. Morley Mao Ming Ying Zhang Z. Morley Mao Ming ZhangZhang

Src

Routing disruptions impact Routing disruptions impact application performanceapplication performance More applications today have high QoS requirementsMore applications today have high QoS requirements

Routing events can cause high loss and long delaysRouting events can cause high loss and long delays

AS BAS C

Internet

AS D

AS EDst

Existing approaches to diagnose Existing approaches to diagnose routing disruptions are ISP-centricrouting disruptions are ISP-centric

Require routing data from many routers in Require routing data from many routers in ISPs ISPs

[Feldmann04, Teixeira04, Wu05][Feldmann04, Teixeira04, Wu05] Passive and accuratePassive and accurate

33

AS C

Internet

AS DAS B

BGP collectors

Limitations of ISP-centric Limitations of ISP-centric approachesapproaches

Difficult to gain access to data from many ISPsDifficult to gain access to data from many ISPs BGP data reflects “expected” data-plane pathsBGP data reflects “expected” data-plane paths

44

AS C

Internet

AS DAS B

End-systems

? ??

? ?? ?

ISP

Can we diagnose entirely from end Can we diagnose entirely from end systems?systems? Goal: infer data-plane paths of many routersGoal: infer data-plane paths of many routers

55

Dst

ISP AAS B

AS C

AS D

Probing host

Our approach: end systems based Our approach: end systems based monitoringmonitoring Only require probing from end hostsOnly require probing from end hosts Cover all the Cover all the PoPPoPs of a target ISPs of a target ISP

66

Dst

Target ISP

AS B

AS C

AS D

Probing host

Our approach: end systems based Our approach: end systems based monitoringmonitoring Cover most of the destinations on the Cover most of the destinations on the

InternetInternet

77

ISP AAS B

AS C

AS D

Probing host

Dst

DstDst

Dst

Our approach: end systems based Our approach: end systems based monitoringmonitoring Identify routing changes by comparing Identify routing changes by comparing

paths measured consecutivelypaths measured consecutively

88

Dst

ISP AAS BAS C

AS D

Probing host

Advantages and challengesAdvantages and challenges

Advantages:Advantages:No need to access to ISP-propriety dataNo need to access to ISP-propriety dataIdentify actual data-plane pathsIdentify actual data-plane pathsMonitor data plane performanceMonitor data plane performance

Challenges:Challenges:Limited resources to probeLimited resources to probe

Coverage of probed pathsCoverage of probed pathsTiming granularityTiming granularity

Measurement noiseMeasurement noise

99

System architectureSystem architecture

1010

Event identification and classification

Event identification and classification

Collaborative probing

Collaborative probing

Event correlation and inference

Event correlation and inference

Event impact analysisEvent impact analysis

Reports

Target ISP

Target ISP

Target ISP

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

1111

Collaborative probingCollaborative probing Using a set of hosts Using a set of hosts

To learn the routing state To learn the routing state To improve coverage To improve coverage To reduce overheadTo reduce overhead

1212

ISP AAS B

AS C

AS D

Probing host

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

1313

Event classificationEvent classification

Classify events according to ingress/egress Classify events according to ingress/egress changeschanges

1414

Destination Prefix P

Target ISP

Probing host

Type1: Ingress PoP changesType2: Ingress PoP same, egress PoP different

Type3: Ingress PoP same, egress PoP same

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

1515

Likely causes: link failuresLikely causes: link failures

16161616

Destination Prefix P

Target ISPOld path New path

Probing host

Old egress PoP New egress PoP

Neighbor AS

Likely causes: internal distance Likely causes: internal distance changeschanges

17171717

distance: 120

Probing host

Old egress PoP New egress PoP

Hot potato changes Hot potato changes Cost of old internal path increasesCost of old internal path increases Cost of new internal path decreasesCost of new internal path decreases

Neighbor AS

distance: 80distance: 100 distance: 120

Event correlationEvent correlation

Spatial correlation: a single network Spatial correlation: a single network failure often affects multiple routersfailure often affects multiple routers

Temporal correlation: routing events Temporal correlation: routing events occurring close together are likely occurring close together are likely due to only a few causesdue to only a few causes

1818

Inference methodologyInference methodology An evidence: an event that supports the An evidence: an event that supports the

causecause

1919

Destination prefix P

Target ISP Probing host

New path

Probing host

New egressCause: Link L is down

Link L

Inference methodologyInference methodology A conflict: a measurement trace that A conflict: a measurement trace that

conflicts with the causeconflicts with the cause

2020

Destination prefix P

Target ISP Probing host

New path

Probing host

New egressCause: Link L is down

Link L

Inference methodologyInference methodology

2121

Evidence node[1,2,3]->[1,2,4]

Cause: link 2-3 down

Cause: node 3 withdraws the

route

AS 1

AS 2

AS 3 AS 4Withdrawal

Inference methodologyInference methodology

2222

Evidence node[1,2,3]->[1,2,4]

Evidence node[0,2,3]->[0,2,4]

Cause: link 2-3 down

Cause: node 3 withdraws the

route

Evidence Graph

AS 1

AS 2

AS 3 AS 4

AS 0

Withdrawal

Inference methodologyInference methodology

2323

Conflict node[1,2,3,6]

Cause: link 2-3 down

Cause: node 3 withdraws the route

Conflict node[0,2,3,6]

Conflict Graph

Conflict node[0,2,3]

AS 1

AS 2

AS 3

AS 0

AS 6

Inference methodologyInference methodology

2424

Evidence node[1,2,3]->[1,2,4]

Evidence node[0,2,3]->[0,2,4]

Conflict node[1,2,3,6]

Conflict node[0,2,3,6]

Evidence Graph Conflict Graph

Conflict node[0,2,3]

Greedy algorithm: minimum set of causes that can Greedy algorithm: minimum set of causes that can explain all the evidence while minimizing conflictsexplain all the evidence while minimizing conflicts

Evidence: 2Conflicts: 3

Evidence: 2Conflicts: 0

OutlineOutline

Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation

2525

ISPs studiedISPs studied

2626

AS Name ASN (Tier)

Periods # of Src # of PoPs

# of Probes

Probe Gap

AT&T 3/23-4/9 230 111 61453 18.3 min

Verio 4/10-4/229/13-9/22

218 46 81024 19.3 min

Deutsche Telekom

4/23-5/22 149 64 27958 17.5 min

Savvis 5/23-6/24 178 39 40989 17.4 min

Abilene 9/23-9/302/3-2/17

113 11 51037 18.4 min

Results of event classificationResults of event classification Many events are internal changesMany events are internal changes Abilene has many ingress changesAbilene has many ingress changes

2727

Target AS

Total events (% all traces)

Diff egress

Same ingress, egress Diff ingressInternal

PoP pathExternal AS path

AT&T 0.35% 12.1% 51% 35% 11%

Verio 0.31% 27.3% 48% 19% 9.8%

Deutsche Telekom

0.66% 4.9% 8.5% 80.7% 7.2%

Savvis 0.35% 11% 45% 31% 14%

Abilene 0.24% 13.6% 37% 40% 17%

Validation with BGP based Validation with BGP based approach [Wu05]approach [Wu05] Hot potato changes: egress point changes Hot potato changes: egress point changes

due to internal distance changes due to internal distance changes

2828

Hot potato changes

BGPbased

Our method

Both

Tier-1 AS 147 185 101(31%, 45%)

Abilene network

79 88 60(24%, 31%)

Number of incidences identified

by BGP method

Number of incidences identified

by our method

Number of incidences identified

by both

False negative,false positives

Validation with BGP based Validation with BGP based approachapproach Session resets: peering link up/downSession resets: peering link up/down Inaccuracy reasons:Inaccuracy reasons:

Limited coverageLimited coverage Coarse-grained probingCoarse-grained probing Measurement noiseMeasurement noise

2929

Session reset

BGPbased

Our method

Both

Tier-1 AS 9 15 6(33%, 50%)

Abilene network

7 11 7(0%, 36%)

System performanceSystem performance

Can keep up with generated routing Can keep up with generated routing statestate

Applicable for real-time diagnosis and Applicable for real-time diagnosis and mitigationmitigationReactive: construct alternate paths to Reactive: construct alternate paths to

bypass the problembypass the problemProactive: avoid paths with many historical Proactive: avoid paths with many historical

routing disruptionsrouting disruptions

3030

ConclusionConclusion

Developed the first system to Developed the first system to diagnose routing disruptions purely diagnose routing disruptions purely from end systemsfrom end systems

Used a simple greedy algorithm on Used a simple greedy algorithm on two bipartite graphs to infer causestwo bipartite graphs to infer causes

Comprehensively validated the Comprehensively validated the accuracyaccuracy

3131

Thank you!Thank you!

Questions?Questions?

3232

Performance impact analysisPerformance impact analysis

End-to-end latency changes caused End-to-end latency changes caused by different types of routing eventsby different types of routing events

3333

Validation with BGP dataValidation with BGP data

BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from a Tier-1 ISPBGP feeds from a Tier-1 ISP

The destination prefix coverage and the routing The destination prefix coverage and the routing event detection rateevent detection rate

3434

Target AS

Dst. Prefix coverage

Dst. Prefix traversing PoPs with BGP feeds

Detected events (AS change, next hop change)

Missed events(short-duration, filtering, other)

AT&T 15% 1.5% 11% (10.3%, 3.2%)

89% (75%, 13%, 1%)

Verio 18.6% 18.1% 23% (19.1%, 8.6%)

77% (73%, 4%, 0%)

Savvis 7.8% 1.1% 6% (5.8%, 0.5%) 94% (80%, 9%, 5%)

Abilene

6% 6% 21% (17.3%, 5.8%)

79% (61%, 15%, 3%)

Event classification: Event classification: same ingress PoP, different egress same ingress PoP, different egress PoP PoP

35353535

Target ISPOld path New path

Probing host

Old egress PoP New egress PoP

Policy changesPolicy changes Local preference in the old route decreasesLocal preference in the old route decreases Local preference in the new route increasesLocal preference in the new route increases

Neighbor ASLocal Pref :

100->50

Local Pref : 60->110

Event classification: Event classification: same ingress PoP, different egress same ingress PoP, different egress PoP PoP

36363636

Target ISPOld path New path

Probing host

Old egress PoP New egress PoP

External routing changesExternal routing changes Old route worsens due to external factors (withdrawal, longer Old route worsens due to external factors (withdrawal, longer

AS path)AS path) New route improves due to external factorsNew route improves due to external factors

AS AABCD->ABEFD BCEFD->BEFDAS B

Event classification: Event classification: same ingress PoP, same egress same ingress PoP, same egress PoP PoP Internal PoP path changesInternal PoP path changes

Cost of old internal path increasesCost of old internal path increases Cost of new internal path decreasesCost of new internal path decreases

External AS path changesExternal AS path changes

37373737

Destination Prefix P

Target ISP

Old path New path

Probing host

Results of cause inferenceResults of cause inference

3838

Effectiveness of inference algorithmEffectiveness of inference algorithm Clusters: a group of events with the same root Clusters: a group of events with the same root

causecause

Event identificationEvent identification

A routing event: path changesA routing event: path changes Event identificationEvent identificationomparing continuous routing snapshotsomparing continuous routing snapshots

3939

Dst

ISP AAS BAS C

AS D

Probing host

top related