detecting peering infrastructure outages in the wild...detecting peering infrastructure outages in...

58
Detecting Peering Infrastructure Outages in the Wild Vasileios Giotsas , Christoph Dietzel § , Georgios Smaragdakis ‡ † , Anja Feldmann , Arthur Berger ¶ ‡ , Emile Aben # TU Berlin CAIDA § DE-CIX MIT Akamai # RIPE NCC

Upload: others

Post on 20-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Detecting Peering Infrastructure Outages in the Wild

Vasileios Giotsas †∗, Christoph Dietzel † §, Georgios Smaragdakis ‡ †, Anja Feldmann †, Arthur Berger ¶ ‡, Emile Aben #

†TU Berlin ∗CAIDA §DE-CIX ‡MIT ¶Akamai #RIPE NCC

Page 2: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Peering Infrastructures are critical part of the interconnection ecosystem

Internet Exchange Points (IXPs) provide a shared switching fabric for

layer-2 bilateral and multilateral peering.○ Largest IXPs support > 100 K of peerings, > 5 Tbps peak traffic

○ Typical SLA 99.99% (~52 min. downtime/year)1

Carrier-neutral co-location facilities (CFs) provide infrastructure for

physical co-location and cross-connect interconnections.○ Largest facilities support > 170 K of interconnections

○ Typical SLA 99.999% (~5 min. downtime/year)2

1 https://ams-ix.net/services-pricing/service-level-agreement 2http://www.telehouse.net/london-colocation/

2

Page 3: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outages in peering infrastructures can severely disrupt critical services and applications

3

Page 4: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outages in peering infrastructures can severely disrupt critical services and applications

4

Outage detection crucial to improve situational awareness,

risk assessment and transparency.

Page 5: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Current practice: “Is anyone else having issues?”

5

● ASes try to crowd-source the detection and localization of outages.

● Inadequate transparency/responsiveness from infrastructure operators.

Page 6: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Symbiotic and interdependent infrastructures6

https://www.franceix.net/en/technical/infrastructure/

Page 7: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Remote peering extends the reach of IXPs and CFs beyond their local market

Global footprint of AMS-IXhttps://ams-ix.net/connect-to-ams-ix/peering-around-the-globe

7

Page 8: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Our Research Goals

1. Outage detection:

○ Automated, Timely, Building-level

2. Outage localization:

○ Distinguish cascading effects from outage source

3. Outage tracking:

○ Determine duration, shifts in routing paths, geographic spread

8

Page 9: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

9

Actual incident

Page 10: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

10

Beforeoutage

VP

Actual incident Observed paths

Page 11: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

11

Beforeoutage

VP

Actual incident Observed paths

Page 12: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

12

Beforeoutage

Duringoutage

VP

Actual incident Observed paths

Page 13: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

13

AS path does not change!

Beforeoutage

Duringoutage

1. Capturing the infrastructure-level hops between ASes

VP

Actual incident Observed paths

Page 14: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

14

Beforeoutage

Duringoutage

IXP or Facility 2 failed

1. Capturing the infrastructure-level hops between ASes

VP

Actual incident Observed paths

Page 15: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

15

IXP is still active

Beforeoutage

Duringoutage

IXP or Facility 2 failed

Duringoutage

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points

VP

VP

Actual incident Observed paths

Page 16: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

16

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

Beforeoutage

Duringoutage

Duringoutage

VP

VPNo hop changes

The initial hops

changed

Actual incident Observed paths

Page 17: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

17

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

France-IX topology

Djibouti Telecom

Telkom Indonesia

Page 18: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

18

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

BGP measurement

BGP

BGP

BGP

Djibouti Telecom

Telkom Indonesia

Page 19: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

19

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

BGP

BGP

BGP

Traceroute measurement

149.6.154.142 37.49.237.126Telkom

Indonesia

Page 20: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

20

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

BGP

BGP

BGP

Traceroute measurement

Traceroute

Traceroute

Traceroute

149.6.154.142 37.49.237.126

3 Giotsas, Vasileios, et al. "Mapping peering interconnections to a facility", CoNEXT 20154 Motamedi, Reza, et al. “On the Geography of X-Connects”, Technical Report CIS-TR-2014-02. University of Oregon, 20145 Nomikos, George, et al. "traIXroute: Detecting IXPs in traceroute paths.". PAM 2016

Telkom Indonesia

IP-to-Facility3,4 and IP-to-IXP5 mapping possible but expensive!

Djibouti Telecom

Page 21: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

21

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

BGP

BGP

BGP

Traceroute

Traceroute

Traceroute

Can we combine continuous passive measurements with fine-

grained topology discover?

Page 22: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Challenges in detecting infrastructure outages

22

1. Capturing the infrastructure-level hops between ASes2. Correlating the paths from multiple vantage points3. Continuous monitoring of the routing system

BGP

BGP

BGP

Traceroute

Traceroute

Traceroute

Page 23: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:200

23

Page 24: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:200

24

BGP Communities:

● Optional attribute

● Encodes arbitrary

metadata

● Series of 32-bit

numerical values

Page 25: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:200

Top 16 bits:

ASN that sets

the community.

Bottom 16 bits:

Numerical value

that encodes the

actual meaning.

25

Page 26: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:200

The BGP Community 2:200

is used to tag routes

received at Facility 2

26

Page 27: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 3.3.3.3/24ASPATH: 4 3

COMMUNITY: 4:8714 4:400

PREFIX: 2.2.2.2/24ASPATH: 4 2

COMMUNITY: 4:8714 4:400

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:200

27

Page 28: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 3.3.3.3/24ASPATH: 4 3

COMMUNITY: 4:8714 4:400

PREFIX: 2.2.2.2/24ASPATH: 4 2

COMMUNITY: 4:8714 4:400

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:200

Multiple communities

can tag different types

of ingress points.

28

Page 29: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Deciphering location metadata in BGP

PREFIX: 3.3.3.3/24ASPATH: 4 3

COMMUNITY: 4:400

PREFIX: 2.2.2.2/24ASPATH: 4 2

COMMUNITY: 4:8714 4:400

PREFIX: 1.0.0.0/24ASPATH: 2 1 0

COMMUNITY: 2:100

When a route changes ingress

point, the community values will

be update to reflect the change.

29

Page 30: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Interpreting BGP Communities

● Community values not

standardized.

● Documentation in public data

sources:

○ WHOIS, NOCs websites

● 3,049 communities by 468 ASes

30

Page 31: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Topological coverage

31

● ~50% of IPv4 and ~30% of IPv6

paths annotated with at least one

Community in our dictionary.

● 24% of the facilities in PeeringDB,

98% of the facilities with at least 20

members.

Page 32: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Initialization32

For each vantage point (VP) collect all the stable BGP routes

tagged with the communities of the target facility (Facility 2)

Time

Page 33: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Initialization33

For each vantage point (VP) collect all the stable BGP routes

tagged with the communities of the target facility (Facility 2)

AS_PATH: 1 x

COMM: 1:FAC2AS_PATH: 2 1 0

COMM: 2:FAC2

AS_PATH: 4 x

COMM: 4:FAC2

Time

Page 34: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Monitoring34

Track the BGP updates of the stable paths for changes in the

communities values that indicate ingress point change.

Time

Page 35: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Monitoring35

AS_PATH: 2 1 0

COMM: 2:FAC1

We don’t care about AS-level path

changes if the ingress-tagging

communities remain the same.

Time

Page 36: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Outage signal36

AS_PATH: 2 1 0

COMM: 2:FAC1

AS_PATH: 1 x

COMM: 1:FAC1

AS_PATH: 4 x

COMM: 4:FAC4

4:IXP

● Concurrent changes of communities values for the same facility.

● Indication of outage but not final inference yet!

Time

Page 37: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Outage signal37

AS_PATH: 2 1 0

COMM: 2:FAC1

AS_PATH: 1 x

COMM: 1:FAC1

AS_PATH: 4 x

COMM: 4:FAC4

4:IXP

● Concurrent changes of communities values for the same facility.

● Indication of outage but not final inference yet!

Partial outage

Time

Page 38: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Outage signal38

AS_PATH: 2 1 0

COMM: 2:FAC1

AS_PATH: 1 x

COMM: 1:FAC1

AS_PATH: 4 x

COMM: 4:FAC4

4:IXP

● Concurrent changes of communities values for the same facility.

● Indication of outage but not final inference yet!

Partial outage?

De-peering of large ASes?

Major routing policy change?

Time

Page 39: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Outage signal39

AS_PATH: 2 1 0

COMM: 2:FAC1

AS_PATH: 1 x

COMM: 1:FAC1

AS_PATH: 4 x

COMM: 4:FAC4

4:IXP

Signal investigation:

● Targeted active measurements.

● How disjoint are the affected paths?

● How many ASes and links have been affected?

Partial outage?

De-peering of large ASes?

Major routing policy change?

Time

Page 40: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Passive outage detection: Outage tracking40

AS_PATH: 1 x

COMM: 1:FAC2AS_PATH: 2 1 0

COMM: 2:FAC2

End of outage inferred when the majority

of paths return to the original facility.

Time

Page 41: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

De-noising of BGP routing activity41

Time

Num

ber

of B

GP

messages (

log)

105

103

101

The aggregated activity of BGP

messages (updates, withdrawals,

states) provides no outage indication.

Page 42: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

De-noising of BGP routing activity42

The aggregated activity of BGP

messages (updates, withdrawals,

states) provides no outage indication.

The BGP activity filtered using

communities provides strong

outage signal.

Time

Num

ber

of B

GP

messages (

log)

105

103

101

Time

Nu

mb

er

of B

GP

me

ssa

ge

s (

log

)

105

103

101

1.0

0.4

0.2

0.6

0.8

Fra

ctio

n o

f in

fra

str

uctu

re p

ath

s

0

Page 43: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

43

● The location of community values that trigger outage signals

may not be the outage source!

● Communities encode the ingress point closest (near-end) to our

VPs:

○ ASes may be interconnected over multiple intermediate

infrastructures

○ Failures in intermediate infrastructures may affect the near-end

infrastructure paths

Outage localization is more complicated!

Page 44: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!44

Time

Page 45: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!45

Time

Page 46: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!46

Outage in Facility 2 causes drop in the paths of Facility 4!

Time

Page 47: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!47

Time

Page 48: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!48

Outage in Facility 3 causes drop in the paths of Facility 4!

Time

Page 49: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage source disambiguation and localization49

● Create high-resolution co-location maps:

○ AS to Facilities, AS to IXPs, IXPs to Facilities

○ Sources: PeeringDB, DataCenterMap, operator websites

● Decorrelate the behaviour of affected ASes based on their

infrastructure colocation.

Page 50: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!50

Far-end ASes colocated in Facility 2

Time

Page 51: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage localization is more complicated!51

Far-end ASes colocated in Facility 3

Time

Page 52: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage source disambiguation and localization52

Paths not investigated in aggregated manner, but at the

granularity of separate (AS, Facility) co-locations.

London Telecity HE8/9 outage

London Telehouse North outage

Time

Page 53: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Outage source disambiguation and localization53

London Telecity HE8/9 outage

London Telehouse North outage

London Telecity HE8/9 outage

London Telehouse North outage

Paths not investigated in aggregated manner, but at the

granularity of separate (AS, Facility) co-locations.

Time

Page 54: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Detecting peering infrastructure outages in the wild

54

● 159 outages in 5 years of BGP data○ 76% of the outages not reported in popular mailing lists/websites

● Validation through status reports, direct feedback, social media○ 90% accuracy, 93% precision (for trackable PoPs)

Page 55: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Effect of outages on Service Level Agreements

55

~70% of failed facilities below 99.999% uptime

~50% of failed IXPs below 99.99% uptime

5% of failed infrastructures below 99.9% uptime!

Page 56: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Measuring the impact of outages56

> 56 % of the affected links in different country, > 20% in different continent!

Median RTT rises by > 100 ms for rerouted paths during AMS-IX outage.

Nu

mb

er

of a

ffe

cte

d li

nks (

log

)

105

103

101

CD

F

1.0

0.4

0.2

0.6

0.8

0

0.44

Distance from outage source (km)12K8K 10K6K4K0 2K

Fra

ctio

n o

f p

ath

s

RTT (ms)

Page 57: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Conclusions

● Timely and accurate infrastructure-level outage detection through

passive BGP monitoring

● Majority of outages not (widely) reported

● Remote peering and infrastructure interdependencies amplify the

impact of local incidents

● Hard evidence on outages can improve accountability, transparency

and resilience strategies

57

Page 58: Detecting Peering Infrastructure Outages in the Wild...Detecting peering infrastructure outages in the wild 54 159 outages in 5 years of BGP data 76% of the outages not reported in

Thank you!

58

[email protected]