detailed analysis of isis routing protocol on the qwest...

31
Packet Design 1 Detailed Analysis of ISIS Routing Protocol on the Qwest Backbone: Cengiz Alaettinoglu [email protected] Stephen Casner [email protected] A recipe for subsecond ISIS convergence

Upload: others

Post on 13-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 1

Detailed Analysis of ISISRouting Protocol on the Qwest Backbone:

Cengiz [email protected]

Stephen [email protected]

A recipe for subsecond ISIS convergence

Page 2: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 2

Why subsecond convergence?

Increased network reliability

Support for multi−service trafficVoice over IP, ATM over IP, TDM over IP, ...

Lower cost/complexity compared to layer 2 protection schemes like SONET

Page 3: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 3

Where are we today?

Current IP re−route times are typically tens of seconds

We need to do better. There are two choices:

Figure out what’s wrong with IP routing and fix it

Replace IP routing with something else

Page 4: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 4

Qwest Backbone ISIS Analysis

Collected multiple week−long ISIS packet traces

Identified problem areas:

causes of ISIS churn and stability

sequence of events and delays during routing convergence

Conclude with a recipe for achieving subsecond ISIS convergence

Page 5: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 5

Monitoring Qwest ISIS Routing

Multi−vendor backbone: Ciscos, Junipers, ...All point−to−point backbone: OC48, OC192, ...

R

Collection Host

Collection Host

RBurbank Houston

tracerouteevery 5s

ISISHELLOs +capture

UDP stream (tg)packet per avg 6 msec

skgig−eth

Collection host ispassive peer,

sends no LSPs

QwestBackbone

Page 6: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 6

Our Typical Path

6 hops

R1 R2

R4 R1

R2R1

BurbankHouston

Los Angeles

R3

R1Sunyvale

Page 7: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 7

ISIS Basics

DetectionLink up/down or peer reachabilityHardware detection is fast & preferredSoftware detection using an HELLO protocol is slower but is a backup

PropagationFlood a Link State Packet (LSP)Link propagation delays + per hop processing delayRate limiting may slow propagation

New Route ComputationRun Dijkstra’s Shortest Path First (SPF) algorithm CPU resource intensiveRate limiting may delay SPF computation & consistency

Page 8: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 8

ISIS Churn and Instability

Churn: number of LSPs received over a time periodrequires SPF calculationsconsumes CPU resources

Busy CPU may cause HELLO packet missescan falsely bring adjacencies downincreases churn

Instabilitychurn => busy CPU => HELLO misses => more churn => ...rate limits are for avoiding this instability

Page 9: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 9

HELLO Packets

Excellent HELLO behavior

0

2000

4000

6000

8000

10000

Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12

Inte

rval

bet

wee

n he

llo p

acke

ts (

mse

cs)

Page 10: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 10

Churn

Very stable network:Average rate for total churn: 1 LSP every 2 seconds97.6% of the LSPs are state refreshes

0

20

40

60

80

100

Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12

Num

ber

of L

SP

s pe

r m

inut

e

Total Churn

0

20

40

60

80

100

Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12N

umbe

r of

LS

Ps

per

min

ute

Physical Churn

High churn region

Page 11: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 11

Churn by Routers & Links

About 800 LSPs per week per router is for refreshesThis can be configured to be less

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100

Num

ber

of L

SP

s ge

nera

ted

Percentage of Routers

0

100

200

300

400

500

600

700

800

900

0 20 40 60 80 100

Num

ber

of F

laps

Percentage of Links

Problem routers

Problem links

Page 12: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 12

One Atypical Router

This router is responsible for the initial high churn

62000

62500

63000

63500

64000

64500

65000

65500

66000

Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12

LSP

seq

uenc

e nu

mbe

r

110 times its usual churn

Page 13: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 13

Up

Down

06:00 06:10 06:20 06:30 06:40 06:50 07:00

September 24, 2001

An Unstable Link

No protection against instabilityGoes on for a day

Opposite of fast convergence requirement30 seconds to go down, 8 seconds to go up

Page 14: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 14

Dealing with Instability

SPF & LSP propagation rate limits don’t reduce churndoes keep CPU from melting by ignoring change

To reduce churn without impacting convergence:

Asymetric up/down filters for fast convergence

detect bad news fast

slow down on good news

Adaptive filters

linear or exponential adaptation to level of instability

Less CPU intensive incremental SPF algorithms

Page 15: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 15

An Example Adaptive Filter

An example exponential filter with 20 minute max penalty

It reduces the churn without hurting convergence

Up

Down

06:00:00 06:15:00 06:30:00 06:45:00 07:00:00

Up

Down

06:00:00 06:15:00 06:30:00 06:45:00 07:00:00

20 mins 20 mins

Page 16: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 16

Qwest ISIS Stability Summary

The backbone is extremely stable

3 out of the 4 week−long data collection periods have no route change on our path

the churn is caused by few problem links

Convergence timesHard to find a link failure to diagnoseConvergence as fast as today’s technology allowsCan be improved to subsecond

Page 17: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 17

ISIS Convergence Delay

Time from the physical change to new routing tables

Failure/repair detection

LSP propagation

Delay due to SPF−interval

SPF computation

Page 18: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 18

What Happens During Convergence?

Routers perform SPF while their views of the network are not consistent, causing:

routing loops

black hole routes

suboptimal routes

If fast convergence, this is not an issue. But,Convergence times are not fastOn high speed links, lots of packets are affectedNew services are less tolerant

Page 19: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 19

A Link Failure

0

0.02

07:05:40 07:05:55 07:06:10 07:06:25

Del

ay a

nd L

oss

October 10, 2001

Packet LossDelay

8.5 seconds

1310 packets R1 R2

R4 R1

R2R1

BurbankHouston

Los Angeles

R3

R1Sunyvale

Why so long?

Page 20: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 20

Slow Link Failure Detection

0

0.02

07:05:40 07:05:55 07:06:10 07:06:25

Del

ay, L

oss,

and

LS

P r

ecep

tion

October 10, 2001

Packet LossDelay

Burbank LSPLos Angeles LSP

20 seconds, HELLO protocol

3 seconds, link layer detection

Page 21: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 21

LSP Propagation Delay

0

200

400

600

800

1000

Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12

LSP

pro

paga

tion

times

(m

secs

)

Difference of 2 clocks

LSP propagations are rate limited

Their scheduling may be improved

R1 R2

R4 R1

R2R1

BurbankHouston

Los Angeles

R3

R1Sunyvale

Page 22: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 22

SPF Rate Limiting

spf−interval parameter delays SPF computation

default: SPF computation after 5 seconds from the change

visible in our convergence delay example

Two goals:

to contain the CPU load

no more than one SPF computation per 5 seconds

to capture 2−4 LSPs reporting the failure in one SPF run

fails to do this in our case

Page 23: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 23

Why Loss and Delay at Link Repair?

00.02

0.19

07:06:30 07:06:35 07:06:40 07:06:45

Del

ay a

nd P

acke

t Los

s

October 10, 2001

Packet LossDelay

R1 R2

R4 R1

R2R1

BurbankHouston

Los Angeles

R3

227 packets lost

SunyvaleR1

Routing loop?

1.5s

Page 24: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 24

TTLs Confirm the Routing Loop

00.02

0.19

07:06:30 07:06:35 07:06:40 07:06:45

Del

ay, T

TL

October 10, 2001

DelayTTL

Longer path

Routing loopConvergence

1.6s 1.5s

ttl=8ttl=11 ttl=8

ttl=60,...,10

Page 25: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 25

spf−interval Spreads SPFs

00.02

0.19

07:06:30 07:06:35 07:06:40 07:06:45

Del

ay, T

TL

and

LSP

rec

eptio

ns

October 10, 2001

DelayTTL

5secs

5secs

Router1 SPF

5secs

Router2 SPFRouter3 SPF

3 routers start timers after 3 different LSPs

Bbank’s lspLA’s lsp

Some lsp

Page 26: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 26

SPF Computation Times

1

10

100

1000

10000

0 10 20 30 40 50 60 70 80 90 100

SP

F C

ompu

tatio

n T

imes

(m

icro

seco

nds)

Percentage of SPF runs

Dijkstra SPFIncremental SPF

Qwest topology and events using equal−cost multiple paths

On average 84 times faster

avg=13usec

avg=1069usec

Page 27: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 27

SPF Scaling

0.01

0.1

1

10

100

1000

1000 2000 3000 4000 5000

SP

F C

ompu

tatio

n tim

es (

mill

isec

onds

)

Number of Routers

Dijkstra SPFIncremental SPF Using random

topology and events

Incremental SPF lets the network grow very large

Source: Haobo Yu.

Page 28: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 28

Where does the time go?

Detection times were several seconds, must be improved

LSP Propagation times were subsecond, but still much larger than link propagation delays

SPF rate limitingspf−interval causes most of the convergence delay

spf−interval spreads SPFs, groups wrong set of LSPs into the same SPF

Page 29: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 29

A Recipe for Subsecond ISIS Convergence

Step 1. Vendors: fast link failure detectionHardware detection is preferred

Vendors have fast failure detection solution for MPLS fast reroute

It will benefit convergence immediately

Step 2. Vendors: adaptive and asymmetric Up/Down filtersIt will reduce the ISIS churn w/o hurting convergence

Step 3. Operators: eliminate current LSP & SPF rate limitsAdaptive asymmetric filters make it safe

Step 4. Vendors: incremental SPF algorithmA must for avoiding CPU meltdowns even as the network gets bigger

Page 30: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 30

Acknowledgements

We would like to thank Chris Sieber, Paul Mabey and Shankar Rao of Qwest for providing access to their network for our study and for their review of our results.

We would also like to thank Haobo Yu, Van Jacobson, Kathleen Nichols, and Kedar Poduri of Packet Design for their contributions to this work.

Page 31: Detailed Analysis of ISIS Routing Protocol on the Qwest Backbonenewnog.com/meetings/nanog24/presentations/cengiz.pdf · 2008-09-02 · ISIS Basics Detection Link up/down or peer reachability

Packet Design 31

What can you do?

If you want subsecond IGP convergence,ask your vendor to implement this recipe.