csci5221: network failures; igp fast convergence and ip fast re-routing 1 network failures and their...

38
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR) • Network Failures and Intra-Domain Routing (IGP) – ISP Link Failure Studies – Failure Characteristics, Causes and Impacts • IGP Fast Routing Convergence •Speed up routing convergence after routing changes • IP Fast Re-Routing (IP FRR) – Fast Rerouting Schemes: Failure Insensitive Routing – Other Schemes Readings: Please do the required readings

Upload: cecily-boone

Post on 17-Jan-2018

229 views

Category:

Documents


0 download

DESCRIPTION

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 3 Understanding Network Link Failure Characteristics Failure Characteristics Within an ISP network –How often do links/routers failure? –How many? Are they random, correlated? –How long do they last? –… What about inter-domain or Internet wide? –What causes BGP to update/withdraw routes? –Destination network down? AS internal failures? BGP session resets? Policy changes? … How do we measure, detect and analyze network failures? How do we trouble-shoot network failures and perform root- case analysis? How do we design more robust and resilient mechanisms?

TRANSCRIPT

Page 1: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

1

Network Failures and Their Impacts;

Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR)• Network Failures and Intra-Domain

Routing (IGP)– ISP Link Failure Studies– Failure Characteristics, Causes and Impacts

• IGP Fast Routing Convergence •Speed up routing convergence after routing changes

• IP Fast Re-Routing (IP FRR)– Fast Rerouting Schemes: Failure Insensitive Routing– Other Schemes

Readings: Please do the required readings

Page 2: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

2

Why Network Fails• Many, many possible reasons and causes• Human Errors

– misconfigurations– other mistakes: e.g, let’s see what that red button does

• Software Bugs– buggy implementation, incompatibility, …

• Hardware failures– flaky interfaces, link errors, fiber cuts, router crashes due to CPU

overload or running of memory, ….• Malicious attacks • Network Overload

– traffic surges causing network congestion, … • Others: e.g., natural disasters, major accidents

– E.g., Baltimore tunnel fire, Ohio train accident, …

Page 3: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

3

Understanding Network Link Failure Characteristics

• Failure Characteristics Within an ISP network – How often do links/routers failure?– How many? Are they random, correlated?– How long do they last?– …

• What about inter-domain or Internet wide?– What causes BGP to update/withdraw routes?– Destination network down? AS internal failures? BGP session

resets? Policy changes? …• How do we measure, detect and analyze network failures?• How do we trouble-shoot network failures and perform

root-case analysis?• How do we design more robust and resilient mechanisms?

Page 4: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

4

This Lecture: Focusing on Impact of Failures within an ISP

Network• With IP networks becoming the dominant and

“converged” information delivery substrate, displacing telephone networks, and eventually cable TV?– Need to better “service availability”– Telephone networks: service availability metrics: 5 9’s: i.e., 99.999%– What about IP networks?

• Effect of IP network failures:– routers lose “reachability”: i.e., no forward entries– or existence of transient/permanent forwarding loops

• What are impacts of network failures?– In particular, on VoIP services

Page 5: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

5

Failures Affect Link Loads• Many ISP networks are “over-provisioned” so as to

handle network failures:– Many claim: normal load utilization < 50%

• But still high variability in link utilization: – Can find a link w/ load > 50% every 15 minutes; > 90%

every 8 days

Page 6: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

6

Traffic Potholes or BlackholesSprint Measurement Study Anecdote:

• Average delay over 5 sec intervals

• Traffic was blackholed for more than 10 minutes

• It took about 40 minutes for the network to reach a stable state

• Root Cause:Route Misconfigurations!

Page 7: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

7

Routing Loops under Failures• Loops due to link failures/new route advertisement• Measurements from 3 backbone links

– 25% packets caught in a loop in one failure instance– 1% lost due to expire TTL; those that escape have long

delays

Page 8: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

8

Sprint Link Failure Study• Link “failures” occur fairly frequently, well spread

over time– Inter-POP links are more stable than intra-PoP links– Many intra-PoP link “failures” due to planned events, less

impact on traffic due to “full-mesh” intra-PoP topology• Most link failures tend to be transient

– Excluding “planned” failures– Most are single link failures– Some are correlated link failures

• Link failure characteristics vary depend on links– Depending causes of failures, e.g., flaky interfaces, router

overloads, fiber cuts, etc.,• Impact of link failures

– OC48 link down for 6 seconds: 3 million packets may be lost! – significant impact on applications such as VoIP, on-line

gamiing

Page 9: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

9

Methodology: Integrated MonitoringSprint Measurement Study [I+02,M+04]• Tier-1 ISP backbone (600+ nodes)• Passive route listener software to collect IS-IS & BGP

updates• IPMON passive traffic monitoring & active probes• SONET alarm logs; router configurations and BGP

policies

Page 10: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

10

IGP Failure Events• IP link: adjacency between two IS-IS routers• Link Failure: loss of this adjacency• Results shown in the following slides only include

– US inter-PoP links (OC48)– Failures less than 24 hrs long

Page 11: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

11

Sprint Study: Link Failure Frequencies

Page 12: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

12

Sprint Study: Duration of Failures

Page 13: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

13

Sprint Study: Failures across Links

Page 14: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

14

Scatter Plot of US Failure Events• Apr. – Nov. 2002

Page 15: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

15

Maintenance (or Planned Failures)• Weekly schedule (Mondays 5am – 2pm UC): 20% of

failures

Page 16: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

16

Examples of Planned Failures• Upgrades

– Changing link to higher capacity– Loading new operating system on a router– Swapping out an old interface card

• Maintenance– Fixing a flaky optical amplifier– Configuration changes that require a reboot– Responsible for 50% of intradomain failures

• Cable intrusions– Construction activities near a fiber

Page 17: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

17

Failure Classification

Page 18: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

18

Anomalies Found in Shaikh04 paper

• Intermittent hardware problem– Router periodically losing OSPF adjacencies– Risk of network partition if 2nd failure occurred

• External link flaps– Congestion on edge link causing lost messages– Lost adjacency leading to flapping routes

• Configuration errors– Two routers assigned the same IP address– Inefficient config leading to duplicate LSAs

• Vendor implementation bug– More frequent refreshing of LSAs than specified

Page 19: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

19

Converging After a Failure• Failure detection

– Router recognizes an incident link has failed• Failure notification

– Router informs other routers about the change• Path re-computation

– Routers compute new paths avoiding the link• Forwarding-table update

– Routers update their forwarding tables– Data traffic starts to flow over the new path

• AT&T, Sprint studies show– convergence time 100s milliseconds up to a few

seconds

Routing convergence

Forwarding convergence

Page 20: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

20

All Together: Looking Inside RouterRoute Processor (CPU)

FIB

Interface card Interface card

Forwarding

SwitchingFabric

Data packet

Data packet

TopologyView

SPF Calculation

OSPF Process

LSA

LS Ack

LSA

Forwarding

LSA Processing

LSA Flooding

SPF Calculation

FIB Update

Page 21: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

21

Bad Things Happen During Convergence

• Transient inconsistencies– Creating “transient forwarding loops” due to– Routers have different views of the network– Forwarding decisions may be inconsistent

• Effects on data traffic– Black-hole: packet loss– Loops: packets going in circles– Delay: packets going on very long paths– Out-of-order: new packets arrive before old ones

• Want to minimize convergence delay– … and especially the effects on the data traffic

Page 22: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

22

Example: Transient Forwarding Loop (or Micro-loop)

• Set of routers disagree– One router acting on old information– Another router acting on new information

s d

Loop!

Page 23: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

23

Reducing Impact of Link Failures Assuming Traditional Link-State Protocols• Improving convergence time of control/data plane

– Reducing timer value for HELO messages– Can achieve sub-second convergence time

• 200 msecs common target, threshold for VoIP quality, do-able!– However,

• Still react to failure events, can’t prevent packet loops or losses during convergence

• may amplify effect of short “transient” failures that last sub-seconds• Prevent “micro loops” during transient routing

convergence periods – One solution: using “ordered FIB updates”– requires coordination among routers, adds complexity,

delays convergence time• Dealing with “Planned Failures” ?

Page 24: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

24

Reducing Impact of Link FailuresUsing MPLS• Can pre-compute back-up paths

– Often done using the “link protection” scheme– For each link, there is a MPLS protection (back-up) path

• But– Need to change “forwarding plane” of routers – Many networks don’t have MPLS deployed

Question: Can we perform fast rerouting using “traditional”

link state routing protocols without resort to MPLS?

Page 25: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

25

Fast Re-Routing using Link State Protocols [Nelakuditi et al]

• Motivations– Most common link failures are transient single-link failures– Hastily react to such failures by LSA flooding may do more

harm than good, causing network instability! – Suppress such failures unless it lasts longer than a threshold– But we want to be able to re-route affected packets along a

back-up path, not simply dropping them !• FIFR (failure Insensitive Fast Re-routing): nearly 100%

forwarding continuity– prepare for (instead of react to) failures– adapt to changes while ensuring stability– Other Advantages:

• no change to forwarding plane• minimal change to routing plane

Page 26: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

26

What is Interface Specific Forwarding?

• Interface-independent forwarding– destination next-hop– Each line card has a copy of the same FIB

• Interface-specific forwarding– <incoming interface, destination> next-hop– Different forwarding entries at each line card

• Forwarding operation remains the same

Page 27: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

27

ISF Enables Local Rerouting• Infer failures based on interface and destination

– Find the farthest keylink whose failure would cause a packet to arrive at the unusual interface along the reverse shortest path to the destination

• Precompute interface-specific forwarding tables– Avoid the keylink in choosing next hop for a destination

• Failure Inferencing based Fast Rerouting– IP fast reroute without explicit routing/tunneling

Page 28: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

28

Illustration: No Failure Scenario

B BC CD DE BF B

A AC AED AE EF E

F

F

Page 29: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

29

Illustration: Local Rerouting without ISF

B BC CD DE BF B

A AC AD AE AF A

Flink B – E fails!

new routing table at router B after detecting the failure

F

Page 30: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

30

Illustration: Local Rerouting with ISF

B BC CD DE BF B

A AC AD AE AF A

B -C CD DE CF D

F

F

Page 31: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

31

ISF Table Computation• Infer failed links from packet’s arrival at an interface

– keylink whose failure causes packet to d arrive at i from j

– A link u -> v is a candidate keylink if• with u->v, j is a next hop from i to d• without u->v, edge j->i is along the shortest path from u to d

– is the farthest one from i among candidate keylinks

• Avoid keylink in choosing the destination’s next hop– next hops to d from i when packet arrives at i from j

• Failure inferencing is not done per packet– ISF table entries computed upon link state updates

dijF

)\( dij

di

dij KERF

dijK

dijK

Page 32: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

32

Illustration: ISF Table ComputationB -C CD DE CF D

E

ABK {B-E}

D

ABK {}

C

ABK {}

{E-F}F

ABKB BC CD -E BF B

B BC -D DE BF B

When no more than one link failure is suppressed in a network with symmetric weights, FIFR always forwards successfully to a destination if a path to it exists

Page 33: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

33

Operations under FIFREvent Adjacent nodes Other nodesPacket arrival Interface-specific forwarding

Link down Initiate local rerouting

Link up before suppression interval

Resume forwarding on the recovered link

Link down beyond suppression interval

Link state update Recompute interface-specific forwarding tables

Link up after suppression interval

Link state update Recompute interface-specific forwarding tables

Page 34: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

34

Handling both Link and Node Failures

• Infer keynodes instead of keylinks– A node u is a candidate keynode if

• with u, j is a next hop from i to d• without u, edge j->i is along the shortest path from the

upstream node of u (w.r.t. the path from i to u) to d– Keynode is the farthest one from i among candidates

• When no route to destination without a node– Node adjacent to the failure assumes link failure– Non-adjacent nodes treat it as adjacent node failure– May cause loops when destination is indeed not reachable

• Protects against non-partitioning single failures

Page 35: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

35

Networks with Asymmetric Link Weights

• FIFR can handle asymmetric link weights– By forcing packets to take reverse shortest path– Provided links are bidirectional

• Keynode computation based on rSPF– A node u is a candidate keynode if

• with u, j is a next hop from i to d• without u, edge i->j is along the shortest path from d to

the upstream node of u (w.r.t the path from i to u)– Keynode is the farthest one from i among candidates– Works with both symmetric and asymmetric weights

Page 36: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

36

Networks with Broadcast Links

• FIFR applicable to networks with broadcast links– A broadcast link is modeled with point to point links

from/to the designated router

• Adjacent failures– Broadcast link failure treated as that of designated router

• Non-adjacent failures– Not necessary to know the previous hop of a packet to

compute interface-specific keynode per destination– Failure inferencing can be done as before

Page 37: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

37

Summary of FIFR• Fast reroute under any single failures

– Without changing/encapsulating IP datagram

• May cause loops under multiple failures– With ISF, guaranteed-protection against single failures

or loop-freedom under multiple failures but not both– Blacklist-based Interface Specific Forwarding

• Needs interface-specific forwarding– Two forwarding entries per destination– O(|E|log2|V|) to compute forwarding entries

Page 38: CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing 1 Network Failures and Their Impacts; Fast IGP Routing Convergence and IP Fast

CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing

38

Other ApproachesSee the optional reading [GRY07] for more detials.•Loop-free Alternative (LFA): fast re-routing only when direct link to (default) next-hop fails

– simpler computation – we know exactly which link to remove when computed new next-hop; but protection limited

– using IP-tunnels, etc.•U-turn: allow protection over multiple hops •Using “Not-Via” Addresses •Multi-topology routing

– routers and links (with possibly different link weights) belong to multiple topologies

• E.g., a default topology, plus “back-up” topologies with various (assumed) links removed (or new link weights)

– packets are “marked” with “topology id” for look-up• IETF Fast Rerouting and MT-Routing Working Groups