nanog 68: decoding performance data from large-scale internet outages

43
NANOG 68: Decoding Performance Data from Large-Scale Internet Outages Mohit Lad CEO, ThousandEyes

Upload: thousandeyes

Post on 16-Apr-2017

563 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

NANOG 68: Decoding Performance Data from Large-Scale Internet Outages Mohit Lad CEO, ThousandEyes

Page 2: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

1

•  Active Probing –  From 100+ geographic locations –  Application Layer: HTTP/DNS availability, load times –  Network layer end-to-end: latency, loss, jitter –  Network layer hops: L3 hop-by-hop measurements on latency, loss

and capturing connectivity and changes •  Routing

–  BGP data from RouteViews, RIPE collectors

The Dataset

Page 3: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

2

~ 170 affected interfaces / hour

Internet Outages Happen All the Time

•  ~ 1.6k prefixes affected / hour

~ 1.6K prefixes / hour

Page 4: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

3

Let’s Start with 3 Events from 2016 DNS Root DDoS AWS Route Leak Submarine Cable Fault

June 26, 2016 May 17, 2016 April 22, 2016

Page 5: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

4

DNS Root DDoS

https://en.wikipedia.org/wiki/Root_name_server#/media/File:Ams-ix.k.root-servers.net.jpg

Page 6: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

5

DNS Root Server DDoS

•  June 25th 2016 2:45-5:50 PDT (21:45-0:50 UTC)

•  All DNS roots affected

•  TCP SYN and ICMP flood

•  10M packets/sec, 17Gbps per root

The Attack

http://root-servers.org/news/events-of-20160625.txt

Page 7: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

6

With Real World Impact

•  50% loss in availability

•  3.7ms!13ms response time

•  Based on 273 measurements every 5 mins

Impacts

Page 8: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

7

The Varying Impact on Root Servers

Page 9: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

8

Corroborated from RIPE Atlas DNSMON Data

https://atlas.ripe.net/dnsmon/

Page 10: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

9

Impact Not As Widespread as Dec 2015 DDoS

https://atlas.ripe.net/dnsmon/

Page 11: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

10

H Root Server – Two Anycast Sites

Page 12: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

11

B Root Server – One Anycast Site

Page 13: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

12

J Root Server –113 Anycast Sites

Page 14: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

13

L Root Server – 154 Anycast Sites

Page 15: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

14

The Varying Impact on Root Servers

Anycast made a difference but capacity at each site was also an important factor

Page 16: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

15

With Real World Impact

Page 17: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

16

•  A-Root makes some BGP changes to deal with the attack

Operators Mitigate

Page 18: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

17

Lots of BGP changes with other Roots as well

Page 19: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

18

Packet loss observed at Upstream as well

Page 20: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

19

1.  Availability (DNS root query) below 90% 2.  Resolution time (DNS root query) >1σ 3.  Multiple roots impacted 4.  Multiple anycast POPs impacted 5.  Multiple upstreams impacted

Many of these indicators correlate to metrics in DDoS attacks on other targets and services as well.

DDoS Indicators To Look Out For

Page 21: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

20

SEA-ME-4 Cable Fault

Page 22: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

21

Tata Backbone Under Normal Conditions

Page 23: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

22

•  May 17th 2016 06:10-8:30 PDT (13:10-15:30 UTC) •  Performance degradation in Tata India to Europe backbone

Trouble in the Tata Backbone

Page 24: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

23

•  06:35-6:40 PDT (13:35-13:40 UTC) •  TISparkle Mediterranean backbone sees complete loss

And Also in Telecom Italia Sparkle

Page 25: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

24

•  Netflix (AS2906) drops TISparkle (AS6762) and begins to route via Level3 (AS3356) instead

•  Traffic flows via Frankfurt rather than Paris (and Marseilles)

European Detour In Effect

Page 26: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

25

•  Multiple, geographically correlated backbone outages

•  Both share Mediterranean transit paths on #Sea-Me-We-3 and Sea-Me-We-4

Commonalities between Tata and TIS

http://www.submarinecablemap.com/#/submarine-cable/seamewe-4

Page 27: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

26

SEA-ME-WE-4 Cable

•  Connects Europe to Middle East, South and SE Asia

•  4.6 Tbps •  Has suffered

more than a dozen major faults

Background

Page 28: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

27

•  6:45-8:10 PDT (13:45-15:10 UTC) •  POPs affected in Europe and Americas

–  Palermo, Santiago, Milan, Catania, Baires, Frankfurt, Paris, Dallas, London, Miami, New York

–  Peering drop with Level 3 Paris

Issues Spread in the TI Sparkle Backbone

Page 29: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

28

•  TI Sparkle backbone connects Latin America to Europe via Miami and New Jersey

Why Would This Impact the Americas?

Page 30: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

29

•  6:45-7:15 PDT (13:45-14:15 UTC) •  Reachability affected for thousands of prefixes due to TI

Sparkle network –  85 prefixes in the Netherlands (BIT, Akamai) –  1479 prefixes in Argentina (Telecom Argentina) –  95 prefixes in Greece (FORTHnet)

•  Ripple effect, Impact beyond Europe/Asia

And BGP Sessions Begin to Fail

Page 31: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

30

•  Segment 4 (Cairo to Marseilles) faulty repeater acknowledged 2 days later

•  Likely cause between Palermo and Marseilles based on the data we’ve seen

Eventual Announcement as to Root Cause

http://www.gossamer-threads.com/lists/nsp/outages/58551?do=post_view_threaded

Page 32: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

31

1.  Many path traces impacted in adjacent POPs on the same network

2.  Jitter can be an even more convincing measure than loss 3.  Multiple networks impacted suggest a cable fault, IXP

failure or peering failure –  Cable fault: Elevated loss, elevated jitter –  IXP failure: Elevated loss on many interfaces in the same POP –  Peering: Terminal loss, path changes

4.  Dropped BGP sessions may occur when problems persist

Cable Fault Indicators To Look Out For

Page 33: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

32

AWS Route Leak

https://upload.wikimedia.org/wikipedia/commons/6/68/Zurich_in_night1.jpg

Page 34: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

33

•  54.239.16.0/20 prefix #Amazon.com (AWS US East)

•  Peering with the expected providers: NTT, TI Sparkle, Telia, CenturyLink, HE

AWS Routes on a Normal Day

http://bgp.he.net/AS16509#_asinfo

Page 35: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

34

•  From Portland OR to AWS US East, traffic normally transits HE Chicago and peers with AWS

Traffic to AWS

Page 36: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

35

AS 200759 Innofield

Route Propagation from AWS

AS 16509 Amazon

AS 46841 Fork

Networking

AS 6939 Hurricane Electric

Border Router

Amazon advertises prefix 54.239.16.0/20

Fork Networking receives route advertisements to Amazon via Hurricane

Electric

Traffic Path AS 65021

Private

Page 37: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

36

•  Traffic goes all the way to Zurich in HE •  And stops there •  Along with a lot of AWS traffic! Also causing outages in

Level 3 and Cogent POPs at the same time.

Why Is Our AWS Traffic in Switzerland?

Page 38: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

37

•  10:10-12:10 PDT (17:10-19:10 UTC) two new prefixes are advertised –  54.239.16.0/21 –  54.239.24.0/21

•  Advertised by AS200759 (Innofield AG)

•  Origin AS65021 (private) •  Via AS6939 (HE)

Route Leak !

HE

Innofield

Page 39: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

38

AS 65021 Private

AS 200759 Innofield

A Route Leak in the Wild

AS 16509 Amazon

AS 46841 Fork

Networking

AS 6939 Hurricane Electric

Traffic Path

Innofield leaks routes for more specific /21 prefixes, directing traffic to private

AS 65021

Hurricane Electric accepts routes and now directs Amazon-

destined traffic to Innofield

Page 40: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

39

•  Prefixes leaked in SwissIX and onto HE

•  Route optimizer is likely cause

•  Similar cause as July 2015 incident with Enzu in Los Angeles

Survey Says… Route Optimizer

http://www.bgpmon.net/large-hijack-affects-reachability-of-high-traffic-destinations/

Page 41: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

40

•  New Prefix or new Destination ASN •  Major BGP route changes – significant change in new path •  Involves ASNs that maybe in geos far from destination ASN •  High packet loss at one of the ASNs in the path or ASNs

with a common next hop ASN

Route Leak Indicators To Look Out For

Page 42: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

41

Summarizing the 3 Events from 2016 DNS Root DDoS AWS Route Leak

Submarine Cable Fault

June 26, 2016 May 17, 2016 April 22, 2016

Page 43: NANOG 68: Decoding Performance Data from Large-Scale Internet Outages

Thank You @mohitlad

https://blog.thousandeyes.com/category/outage-reports/