the impact of internet policy and topology on delayed routing convergence

The Impact of Internet Policyand Topology onDelayed Routing Convergence

Good Old DaysInternet always was and still is BAD in reliability, availability & QoS.For historical reasons QoS just was not

there initially: “Best Effort” Principle.

e-mail & web surfing did not place high standards.

Money TimeIt CAN’T be tolerated any more

Internet is to became an major factor in economy.

e-commerce , VoIP, real-time video, etc.

QoS BattleEfforts to bring QoS to Internet are enormous, BUT:

Stable underlying infrastructure MUST exist for any application level solution!

Bad NewsExisting Internet Backbone DO NOT provide rapid restoration and rerouting

NO effective interdomain path fail-over!Fail-Over for single failure takes

milliseconds in PSTN, minutes in Internet

It hurts!

Impact on performance is huge. While restoring path :30 times more packet loss4 times end-to-end latency

Some fail-overs takes 15 minutes, average 3 minutes.

What’s the Problem?

Slow Convergence during Fail-OverRouting tables oscillates after failure for

long period seeking for consistent network view.

WHO is to be blamed ? BGP - Currently used inter-domain routing protocol.

Nasty things about BGPAS path based BGP solves count-to-infinity of RIP, but exacerbates the number of routing table oscillations.

For unbounded delay BGP : ALL possible paths may be explored after single failure : O(N!).

And More…

Even assuming bounded delay , BGP convergence for full mesh topology without filters is O(N * T DELAY)N is number of AS and there are 70000 of

them in Internet. T DELAY is about 30 sec. (recommendation is

30 sec +/- short random jitter).

And Even More…

It is possible for autonomous systems to define “unsafe” policies causing persistent route oscillation.

So What?!

All this stuff is interesting in theory but has little touch with reality.

Any-Way?

BGP4 used in Internet routers has bounded delay, provided by MinRouteAdver timer delaying distribution of too rapid updates.

So, O(N!) performance is irrelevant in real Internet.

Who cares?

BGP divergence was never observed in practice and remains theoretical problem.

There are modifications to BGP policies guaranteeing convergence.

What a Mesh?!

Internet topology is long way from being complete mesh.

BGP Updates filtering is done by almost every BGP node.

Now What?

Experimental results indicates fail-over problems caused by bad BGP performance.For studying and resolving those problems, much more realistic Internet BGP processes models should be developed.

Drug “Providers?”

Internet retains hierarchy with several tiers of ISPs.

This hierarchy is specified by commercial relationships.Smaller ISP are customers of big ones.

Talk to me…Transit – upstream provider transits service to the customer.Default-free routing tables passed downstream. Customers & backbone routes passed

upstream.Peer – symmetric connection providing access to each other customers. Never used for transit to other ISP.Only customers & backbone routes exchanged.

Backup transit – normally acts like Peer, provides transit after fault detection.

It is strictly businessFiltering mechanism of AS boarder routers are used for emphasizing those commercial relationships: If You don’t want other side to use some route – You should not announce it. So:Send customer & backbones routes to all

peers.Provide with other routing information (learned

from peers & upstream) only customers.

A B

C

D E

F

G

H

J

I

Peer ___Transit ___Back-up ___

Tier 1

Tier 2

Tier 3

No Free LunchesTransit relations – Inbound filters

Prefix filters limiting customer announcement to “legitimate” address space of the customer.Used by 100% ISPs.Upstream customer is willing to transit

routes for its customers only.

Friend to friendPeer relations – Outbound filters

Community filters is based on tagging routes to distinguish customer routes. Only updates from routes tagged as customer routes will pass the filter.Used by 73% of ISPs

Don’t talk too muchPeer relations – Outbound filters (cont.)

Prefix filters also may be used to distinguish customer routes.Applying prefix filter only (used by 13% of

ISPs) may cause creation of unintentional back-up transit path.

Check it

Peer relations – Outbound filters (cont.)ASPaths regular expressions are used to explicitly permit routes advertising.Combination of ASPaths & prefix filters

prevents creation of unintentional back-up transit path.

Both ASPaths & prefix filters are used by 13 % of IPSs.

A B

C

D E

F

G

H

J

I

Peer ___Transit ___Back-up ___UnintentionalBack-up ___

Tier 1

Tier 2

Tier 3

D-C

D-C

Example:In absence of ASPath

check : path “D-C” learned after AD link

failure will be announced to B by A (after DA link failure)

providing unintentional back-up

path from C to B through A.

Trust Me…

Peer relations – Outbound filtersGenerally ISPs just trusts their peers to send only valid information.Only “bogon” filters identifying generally illegal (private, unallocated, etc.) addresses are applied.80% ISPs use “bogon” filters.20% ISPs use none.

Let Us Introduce…Model of BGP convergence is a directed graph.Node represent AS.Model is given for fixed destination X.The shortest path is chosenArc e(u,v) exists iff u informs v about its best route to X (not vice versa)The graph is not symmetricTopology of graph differs for different

destinations X

Up And DownGiven X – client connected to network by single arc to node A (AS of X).Link goes down : TDOWN is the time elapsed until every node knows there is no path to X (new stable state)Connection reestablished : TUP is the time elapsed until all nodes add route to X to their tables.

What We Want to HearAfter establishing connection : Node learns about its best path to X in time dependent on its shortest path to XProof by simple induction.

TUP convergence is ruled by d - maximal shortest distance from X to any node.O(d * T DELAY), where T DELAY is T WAIT + T SEND

T DELAY may be of the same order as MinDelayAdver ,especially if implemented on per peer (not peer + destination) basis

And What We don’t…After A-X link goes down multiple update messages are sent along arcs. Nodes will announce back-up paths for them withdraw

wasn’t received yet. Generally updates will propagate more slowly via long

paths because router add 0 to 30 sec delay Always add ~30 sec after initial update received.

Simple Path from X to A is covered by time T if any node in the path received update from preceding node and resend update to the next node before time T.

Long DownNode U has no route to X in time T iff all simple paths from X to U are covered.Simple path of length L is covered in O(L* T DELAY ) time.TDOWN convergence is ruled by D – length of longest simple path from X to any node.O(D * T DELAY)

What Do You Want?

Minimize network diameter for improving TUP - increase connectivity!Minimize longest possible paths for improving TDOWN - decrease connectivity!NP-complete problem

For full mesh – diameter is 1, longest path is N

Welcome to the Reality6 months of experimental studies.Geographically and topologically diverse BGP sessions with > 20 IPSs.Artificial BGP transitions (announcement & withdraws) injected in > 10 providers.Broad spectrum of other IPSs surveyed.

Real World ExampleJapanese ISP (ISP4) have BGP peer sessions with providers IPS1, ISP2, ISP3 at Mae-West.Withdraw route Ri from IPSi. Observe paths announced by IPS4 for

every case.

ISP4

ISP1

ISP5

R1 Fault

Steady State

The only back-up path explored is ISP1 -> ISP5 -> ISP4. The path explored in 96% events, 92 sec. Average. No path was explored in 4% events, 32 sec. Average.

ISP4

ISP2

ISP5

R2 Fault

Steady State

ISP6

ISP10

ISP13

VagabondPath !

No path was explored in 7% events, 54 sec. Average. Only ISP2-ISP5-ISP4 was explored in 63% events, 79 sec. Average. ISP2-ISP5-ISP4 & ISP2-ISP5-ISP6-ISP4 was explored in 7% events, 88 sec. Average. 11 more unique paths in 45 distinct sequences of announcements. Most of them are “vagabond” back-up paths resulting from router configuration errors.

ISP11

ISP12

It Was an Easy One…

Withdraw of R3 from ISP3 causes exploring fairly complex topology.

More than 20 distinct paths were announced. Almost 150 different combinations of

announcements. Much bigger convergence times (~ 140 sec) Only 35% of those paths are “legitimate” and the

rest are “vagabond” unintentional back-up paths.

Do not Interfere!

Selection & Order of back-up paths depends on interaction of MinRouteAdver timers on routersMinRouteAdver is usually implemented on

peer (not peer+address) basis , so earlier instability interferes.

For example: In ISP1 case in 4% of cases initial delay on IPS4 was longer than delay needed to propagate back-up path.

LA to SF via HaifaVagabond paths were found in the majority of 200 monitored ISP pairs.Usually persist for short period (several days)Those erroneous paths do not conform any intended or published policy.Single error may have global impact mainly cause of lack of inbound filters on peer connections.Vagabond paths may impact performance and need to be automatically detected!

You call It line?Average convergence delay clearly corresponds to the length of the longest back-up path.Back-up paths are determined by policy

and topology.

Data contains significant variability but linear relationships is clued by the experimental data.

But Some are more equalTopology is dependent on ISP tier.Smaller ISP typically purchase transit from

multiple upstream providers.Smaller ISP implements back-up transits

policy unnecessary in large ISPs.Longest legitimate path : 9 ASes for Tier 1, 12 ASes for Tier 2.

This way Supported by the provided example: ISP1 is large tier-1 backbone provider ISP2 is moderate sized US-based tier-2

provider ISP3 is regional tier-3 network

Tier-1 & tier-2 topology is much simpler and their customers are much less impacted by fail-over problems.

Now You SeeInternet lacks the level of reliability required by its future role.Route fail-over complexity scales linearly with longest back-up for the route.The back-up paths length depends on number of contractuals & policy implementation.

Advices are for freeFor Customer: If You do mission-critical stuff , connect to large providers.For Small ISP: Limit number of transit & backup transit connections.For All ASes: Avoid vagabond paths.Better route validation & authentication

mechanism are needed.

Any Proposals??

Adaptive MinRouteAdver timers?Additional information inclusion into BGP withdrawal messages?Other?

the impact of internet policy and topology on delayed routing convergence

Documents