1 week 7 routing protocols. 2 intra-as routing r also known as interior gateway protocols (igp) r...

Click here to load reader

Post on 28-Dec-2015

213 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Week 7 Routing Protocols

  • Intra-AS RoutingAlso known as Interior Gateway Protocols (IGP)Most common Intra-AS routing protocols:

    RIP: Routing Information Protocol

    OSPF: Open Shortest Path First

    IGRP: Interior Gateway Routing Protocol (Cisco proprietary)

  • RIP ( Routing Information Protocol)Distance vector algorithmIncluded in BSD-UNIX Distribution in 1982Distance metric: # of hops (max = 15 hops)

  • RIP advertisementsDistance vectors: exchanged among neighbors every 30 sec via Response Message (also called advertisement)Each advertisement: list of up to 25 destination nets within AS

  • RIP: Example Destination Network Next Router Num. of hops to dest. wA2yB2 zB7x--1......wxyzACDBRouting table in D

  • RIP: Example Destination Network Next Router Num. of hops to dest. wA2yB2 zB A7 5x--1......Routing table in D Dest Next hops w - - x - - z C 4 . ...Advertisementfrom A to D

  • RIP: Link Failure and Recovery If no advertisement heard after 180 sec --> neighbor/link declared deadroutes via neighbor invalidatednew advertisements sent to neighborsneighbors in turn send out new advertisements (if tables changed)link failure info quickly propagates to entire netpoison reverse used to prevent ping-pong loops (infinite distance = 16 hops)

  • OSPF (Open Shortest Path First)open: publicly availableUses Link State algorithm LS packet disseminationTopology map at each nodeRoute computation using Dijkstras algorithm

    OSPF advertisement carries one entry per neighbor routerAdvertisements disseminated to entire AS (via flooding)Carried in OSPF messages directly over IP (rather than TCP or UDP

  • OSPF advanced features (not in RIP)Security: all OSPF messages authenticated (to prevent malicious intrusion) Multiple same-cost paths allowed (only one path in RIP)For each link, multiple cost metrics for different TOS (e.g., satellite link cost set low for best effort; high for real time)Integrated uni- and multicast support: Multicast OSPF (MOSPF) uses same topology data base as OSPFHierarchical OSPF in large domains.

  • Hierarchical OSPF

  • Hierarchical OSPFTwo-level hierarchy: local area, backbone.Link-state advertisements only in area each nodes has detailed area topology; only know direction (shortest path) to nets in other areas.Area border routers: summarize distances to nets in own area, advertise to other Area Border routers.Backbone routers: run OSPF routing limited to backbone.Boundary routers: connect to other ASs.

  • Internet inter-AS routing: BGPBGP (Border Gateway Protocol): the de facto standardBGP provides each AS a means to:Obtain subnet reachability information from neighboring ASs.Propagate the reachability information to all routers internal to the AS.Determine good routes to subnets based on reachability information and policy.Allows a subnet to advertise its existence to rest of the Internet: I am here

  • BGP basicsPairs of routers (BGP peers) exchange routing info over semi-permanent TCP conctns: BGP sessionsNote that BGP sessions do not correspond to physical links.When AS2 advertises a prefix to AS1, AS2 is promising it will forward any datagrams destined to that prefix towards the prefix.AS2 can aggregate prefixes in its advertisement

  • Distributing reachability infoWith eBGP session between 3a and 1c, AS3 sends prefix reachability info to AS1.1c can then use iBGP do distribute this new prefix reach info to all routers in AS11b can then re-advertise the new reach info to AS2 over the 1b-to-2a eBGP sessionWhen router learns about a new prefix, it creates an entry for the prefix in its forwarding table.

  • Outline1. Highlights2. Addressing and CIDR3. BGP Messages and Prefix Attributes4. BGP Decision and Filtering Processes 5. I-BGP6. Route Reflectors7. Multihoming8. Aggregation9. Routing Instability 10. BGP Table Growth

  • Internet Topology AS (Autonomous System) - a collection of routers under the same technical and administrative domain. EGP (External Gateway Protocol) - used between two ASs to allow them to exchange routing information so that traffic can be forwarded across AS borders. Example: BGP

  • Purpose: to share connectivity informationborder routerinternal routerBGPAAS1AS2

  • BGP SessionsOne router can participate in many BGP sessions.Initially node advertises ALL routes it wants neighbor to know (could be >50K routes)Ongoing only inform neighbor of changesBGP SessionsAS1AS2AS3

  • Routing ProtocolsE-BGPAAS2

  • Configuration and PolicyA BGP node has a notion of which routes to share with its neighbor. It may only advertise a portion of its routing table to a neighbor.A BGP node does not have to accept every route that it learns from its neighbor. It can selectively accept and reject messages.What to share with neighbors and what to accept from neighbors is determined by the routing policy, that is specified in a routers configuration file.

  • HistoryPopularity: until a number of years ago BGP fairly unknown. Used by small number of large ISPs. in 1995 (beginning of web popularity) number of organizations using BGP grew tremendously. Two reasons for growth of usage and interest: significant growth in number of ISPs; organizations were born that had mission-critical dependence upon their connectivityCIDR (Classless Inter-Domain Routing) introduced in 1995

  • Assigning IP address and AS numbers (Ideally)A host gets its IP address from the IP address block of its organizationAn organization gets an IP address block from its ISPs address blockAn ISP gets its address block from its own provider OR from one of the 3 routing registries:ARIN: American Registry for Internet NumbersRIPE: Reseaux IP EuropeensAPNIC: Asia Pacific Network Information CenterEach AS is assigned a 16-bit number (65536 total) Currently 10,000 ASs in use

  • Addressing SchemesOriginal addressing schemes (class-based):32 bits divided into 2 parts: Class A Class BClass C CIDR introduced to solve 2 problems:exhaustion of IP address spacesize and growth rate of routing table~2 million nets256 hosts

  • Problem #1: Lifetime of Address SpaceExample: an organization needs 500 addresses. A single class C address not enough (256 hosts). Instead a class B address is allocated. (~64K hosts) Thats overkill -a huge waste.CIDR allows networks to be assigned on arbitrary bit boundaries.permits arbitrary sized masks: 178.24.14.0/23 is validrequires explicit masks to be passed in routing protocolsCIDR solution for example above: organization is allocated a single /23 address (equivalent of 2 class Cs).

  • Problem #2: Routing Table SizeWithout CIDR:232.71.0.0232.71.1.0232.71.2.0..232.71.255.0With CIDR:232.71.0.0/16

  • CIDR: Classless Inter-Domain RoutingAddress format . The prefix denotes the upper P bits of the IP address.Idea - use aggregation - provide routing for a large number of customers by advertising one common prefix.This is possible because nature of addressing is hierarchicalSummarizing routing information reduces the size of routing tables, but allows to maintain connectivity. Aggregation is critical to the scalability and survivability of the Internet

  • Address Arithmetic: Address BlocksThe pair defines an address block:Examples:128.15.0.0/16 => [ 128.15.0.0 - 128.15.255.255 ]188.24.0.0/13 => [ 188.24.0.0 - 188.31.255.255 ] consider 2nd octet in binary:

    Address block sizesa /13 address block has 232-13 addresses (/16 has 232-16)a /13 address block is 8 times as big as a /16 address block because 232-13 = 232-16 * 23

  • CIDR: longest prefix matchBecause prefixes of arbitrary length allowed, overlapping prefixes can exist.Example: router hears 124.39.0.0/16 from one neighbor and 124.39.11.0/24 from another neighborRouter forwards packet according to most specific forwarding information, called longest prefix matchPacket with destination 124.39.11.32 will be forwarded using /24 entry.Packet w/destination 124.39.22.45 will be forwarded using /16 entry

  • Will CIDR work ?For CIDR to be successful need:address registries must assign addresses using CIDR strategyproviders and subscribers should configure their networks, and allocate addresses to allow for a maximum amount of aggregationBGP must be configured to do aggregation as much as possible Factors that complicate achieving aggregationmultihoming, proxy aggregation, changing providers

  • Four Basic MessagesOpen: Establishes BGP session (uses TCP port #179)Notification: Report unusual conditionsUpdate: Inform neighbor of new routes that become active Inform neighbor of old routes that become inactiveKeepalive: Inform neighbor that connection is still viable

  • OPEN MessageDuring session establishment, two BGP speakers exchange their AS numbersBGP identifiers (usually one of the routers IP addresses)A BGP speaker has option to refuse a sessionSelect the value of the hold timer: maximum time to wait to hear something from other end before assuming session is down.authentication information (optional)

  • NOTIFICATION and KEEPALIVE MessagesNOTIFICATIONIndicates an errorterminates the TCP sessiongives receiver an indication of why BGP session terminatedExamples: header errors, hold timer expiry, bad peer AS, bad BGP identifier, malformed attribute list, missing required attribute, AS routing loop, etc.KEEPALIVEprotocol requires some data to be sent periodically. If no UPDATE to send within the specified time period, then send KEEPALIVE message to assure partner that connection still alive

  • UPDATE Messageused to either advertise and/or withdraw prefixespath attributes: list of attributes that pertain to ALL the prefixes in the Reachability Info fieldWithdrawn routes length (2 octets) Withdrawn routes (variable length)Total path attributes length (2 octets) Path Attributes (variable length)Reachability Information (variable length)FORMAT:

  • Advertising a prefixWhen a router advertises a prefix to one of its BGP neighbors:information is valid until first router explicitly advertises that the information is no longer validBGP does not require routing information to be refreshedif node A advertises a path for a prefix to node B, then node B can be sure node A is using that path itself to reach the destination.

  • ATTRIBUTESORIGIN: Who originated the announcement? Where was a prefix injected into BGP?IGP, EGP or Incomplete (often used for static routes)AS-PATH:a list of ASs through which the announcement for a prefix has passedeach AS prepends its AS # to the AS-PATH attribute when forwarding an announcementuseful to detect and prevent loops

    Prefix

    Next hop

    AS Path

    128.73.4.21/21

    232.14.63.4

    1239 701 3985 631

  • AS-Path AttributeSequence of ASes a route has traversedLoop detectionApply policyAS 100AS 300AS 200AS 500AS 400170.10.0.0/16180.10.0.0/16150.10.0.0/16Network Path180.10.0.0/16300 200 100170.10.0.0/16300 200150.10.0.0/16300 400Network Path180.10.0.0/16 300 200 100170.10.0.0/16 300 200

  • Loop HandlingAS1172.16.10.0/24AS4AS3

    AS2

    172.16.10.0/24 -- 1172.16.10.0/24 2 1172.16.10.0/24 3 2 1172.16.10.0/24 4 3 2 1

  • AS_path ManipulationAS50192.213.1.0/24AS100AS300

    AS200

    192.213.1.0/24 -- 50192.213.1.0/24 50 50 50192.213.1.0/24 200 50192.213.1.0/24 100 50 50 50192.213.1.0/24 300 200 50

  • Attribute: NEXT HOPFor EBGP session, NEXT HOP = IP address of neighbor that announced the route.For IBGP sessions, if route originated inside AS, NEXT HOP = IP address of neighbor that announced the routeFor routes originated outside AS, NEXT HOP of EBGP node that learned of route, is carried unaltered into IBGP.

    BGP Table at Router C:IP Routing Table at Router C:

    Destination

    Next Hop

    140.20.1.0/24

    2.2.2.2

    209.15.1.0/24

    1.1.1.1

    Destination

    Next Hop

    140.20.1.0/24

    2.2.2.2

    2.2.2.0/24

    3.3.3.3

    3.3.3.0/24

    Connected

    209.15.1.0/24

    1.1.1.1

    1.1.1.0/24

    3.3.3.3

  • Attribute: Multi-Exit Discriminator (MED)when ASs interconnected via 2 or more linksHint to external neighbors about the preferred path into an AS that has multiple entry pointsAS announcing prefix sets MED enables AS2 to indicate its preferenceAS receiving prefix uses MED to select linka way to specify how close a prefix is to the link it is announced ona lower value is preferredcalled an external metricLink BLink AMED=10MED=50AS1AS2AS3

  • Attribute: Local PreferenceUsed to indicate preference among multiple paths for the same prefix anywhere in the internet.The higher the value the more preferredExchanged between IBGP peers only. Local to the AS.Often used to select a specific exit point for a particular destinationBGP table at AS4:

    Destination

    AS Path

    Local Pref

    140.20.1.0/24

    AS3 AS1

    300

    140.20.1.0/24

    AS2 AS1

    100

  • Example SJ LAANETZNETXNETYNET128.213.0.0/16WINETT1T3128.213.0.0/16128.213.0.0/16Set local preference to 200Set local preference to 300Affects the traffic going out of the AS

  • Routing Process OverviewOutputpolicyengineBGP tableIP routingtableChoosebest routeaccept,deny, set preferencesforward,not forwardset MEDsPage 145 of Halabi

  • Input Policy EngineInbound filtering controls outbound trafficfilters route updates received from other peersfiltering based on IP prefixes, AS_PATH, communitydenying a prefix means BGP does not want to reach that prefix via the peer that sent the announcementaccepting a prefix means traffic towards that prefix may be forwarded to the peer that sent the announcementAttribute Manipulationsets attributes on accepted routesexample: specify LOCAL_PREF to set priorities among multiple peers that can reach a given destination

  • BGP Decision Process1. Choose route with highest LOCAL-PREF2. If have more than 1 route, select route with shortest AS-PATH3. If have more than 1 route, select according to lowest ORIGIN type where IGP < EGP < INCOMPLETE4. If have more than 1 route, select route with lowest MED value 5. Select min cost path to NEXT HOP using IGP metrics6. If have multiple internal paths, use BGP Router ID to break tie.

  • Output Policy EngineOutbound Filtering controls inbound trafficforwarding a route means others may choose to reach the prefix through younot forwarding a route means others must use another router to reach the prefixmay depend upon whether E-BGP or I-BGP peerexample: if ORIGIN=EGP and you are a non-transit AS and BGP peer is non-customer, then dont forwardAttribute Manipulationsets attributes such as AS_PATH and MEDs

  • Transit vs. Nontransit ASTransit traffic = traffic whose source and destination are outside the ASNontransit AS: does not carry transit trafficTransit AS: does carry transit traffic Advertise own routes only Do not propagate routes learned from other ASs case 1: Advertises its own routes PLUS routeslearned from other ASs case 2:

  • Clients, Providers and PeersAS has customers, providers and peersRelationships between AS pairs:customer-providerpeer-to-peerType of relationship influences policiesExporting to provider: AS exports its routes & its customers routes, but not routes learned from other providers or peersExporting to peer: (same as above)Exporting to customer: AS exports its routes plus routes learned from its providers and peers

  • Internal BGP (I-BGP)Used to distribute routes learned via EBGP to all the routers within an ASI-BGP and E-BGP are same protocol in thatsame message types usedsame attributes usedsame state machineBUT use different rules for readvertising prefixesRule #1: prefixes learned from an E-BGP neighbor can be readvertised to an I-BGP neighbor, and vice versaRule #2: prefixes learned from an I-BGP neighbor cannot be readvertised to another I-BGP neighbor

  • I-BGP: Preventing Loops and Setting AttributesWhy rule #2? To prevent announcements from looping. In E-BGP can detect via AS-PATH.AS-PATH not changed in I-BGPImplication of rule: a full mesh of I-BGP sessions between each pair of routers in an AS is requiredSetting Attributes: The router that injects the route into the I-BGP mesh is responsible for setting the LOCAL-PREF attribute prepending AS # to AS-PATH

  • Route ReflectorsProblem: requiring a full mesh of I-BGP sessions between all pairs of routers is hard to manage for large ASs.Solution: group routers into clusters. Assign a leader to each cluster, called a route reflector (RR).Members of a cluster are called clients of the RRI-BGP Peeringclients peer only with their RRRRs must be fully meshedclientsclustersRRRR

  • Route Reflectors: Rule on AnnouncementsIf received from RR, reflect to clientsIf received from a client, reflect to RRs and clientsIf received from E-BGP, reflect to all - RRs and clientsRRs reflect only the best route to a given prefix, not all announcements they receive. helps size of routing tablesometimes clients dont need to carry full table

  • Default RoutesIf you dont have a routing entry in the table for a destination, send it along the default routeCan be statically configuredCan be learned via BGPCan have multiple defaultsuse LOCAL_PREF to distinguish primary and backup default routesAS10.0.0.0/0 next_hop=1.1.1.10.0.0.0/0 next_hop=2.2.2.2AS2Local pref=100Local pref=300

  • Multihoming

  • Single-homed vs. Multi-homed subscribersA single-homed network has one connection to the Internet (i.e., to networks outside its domain)A multi-homed network has multiple connections to the Internet. Two scenarios:can be multi-homed to a single providercan be multi-homed to multiple providersWhy multi-home?ReliabilityPerformance a sites bandwidth to internet is sum of bandwidth on all links

  • Single-homed ASSubscriber called a stub ASProvider-Subscriber communication for route advertisement:static configuration. most common.Providers router is configured with subscribers prefix.good if customers routes can be represented by small set of aggregate routesbad if customer has many noncontiguous subnetscan use BGP between provider and stub ASR2SubscriberR1Provider

  • Multihoming to a Single Provider: 4 scenarios

  • Multihoming to Multiple Providers

  • Multihoming IssuesLoad sharinghow distribute the traffic over the multiple links?Reliabilityif load sharing leads to preferering certain links for certain subnets, is reliability reduced?Address/Aggregationwhich subnet addresses should the multihomed customer use?how will this affect its providers ability to aggregate routes?

  • Load sharing from Customer to ISP using policyGoal: send traffic to ISPs customers on one link; send traffic to the rest of the Internet on 2nd linkImplement using policy to control announcementsR3CustomerR1ISPR2advertisecustomerroutes onlyadvertisedefaultroute 0/0trafficblue: announcementsred: traffic

  • Load sharing from ISP to customer using attributesGoal: provider splits traffic across 2 links according to prefixImplement this strategy using attributescustomer sets MEDsprovider sets LOCAL_PREF

  • ExampleSFNYC2C3C4C5SFNY(X,Y)(Z,W)IBGPC5:300C4:300Rest:250C3:300C2:300Rest:200X:200Y:200Rest:300W:200Z:200Rest:250

  • ExampleR2CustomerR3

  • Address/Aggregation IssueWhere should the customer get its address block from?1. From ISP12. From ISP23. From both ISP1 and ISP24. Independently from address registry(cases 1 and 2 are equivalent)

  • Case 1 & 2: Get address block from one ISPexample: customer gets address from ISP 1ISP 1s aggregation is not brokencustomers prefix not aggregatable at ISP 2longer prefix becomes a traffic magnetHow good is load sharing? If all ISPs generate same amount of traffic for customer, then ISP2-customer link twice as loaded as ISP1-customer link

  • Case 3: Get address block from both ISPsannouncement policy: announce prefix only to its parentadvantage: both ISPs can aggregate the prefix they receivedisadvantage: lose reliabilityload balancing good? depends upon how much traffic sent to each prefix140.20/16200.50/16140.20.1/24200.50.1/24140.20.1/24200.50.1/24

  • Case 4: obtain address block from registryno aggregation possibleno traffic magnets createdgood reliabilityhow achieve load sharing?customer breaks address block into 2 /25 blocks, and announce one per link (but may lose reliability)OR use method of AS-PATH manipulation150.55.10/24150.55.10/24

  • AS-PATH manipulationIdea: prepend your AS number in AS-PATH multiple times to discourage use of a linkmakes AS-PATH seem longer than it isrecall BGP decision process uses shortest AS-PATH length as a criteria for selecting best pathExample: ISP 3 will choose path through AS 2 because its AS-PATH appears shorter150.55.10/24 - 33

  • Aggregation

  • Address Arithmetic: When is aggregation possible? Case 1Possible when one prefix contained in another. Example: Two ASs having customer-provider relationship. Provider does the aggregation. Provider has address block 18.0.0.0/8Its customers have address blocks 18.6.0.0/15 and 18.9.0.0/15Provider announces its address block onlyRule: Prefix p1 contains prefix p2 iff length(p2) > length(p1) AND address(p2) / 232-length(p1) = address(p1) / 232-length(p1)

  • Address Arithmetic: When is aggregation possible? Case 2Some pairs of consecutive prefixesExample: routes within the same AS: AS has 2 address blocks: 1.2.2.0/24 = 0000001.00000010.00000010.00000000/24 1.2.3.0/24 = 0000001.00000010.00000011.00000000/24 can announce instead 1.2.2.0/23 Rule: consecutive prefixes p1 and p2 are aggregatable iff length(p1)=length(p2) AND address(p1) / 232-length(p1) +1 = address(p2) / 232-length(p2) AND address(p1) % 233-length(p1) = 0

  • Aggregation and MultihomingMost common scenario: customer breaks its address block in 2 for load sharing purposes. YET, also advertises whole block for reliability.AS 2customer1.2.2.0/231.2.2.0/231.2.2.0/231.2.2.0/241.2.3.0/24

  • Black holes and cardinal sinsThe cardinal sin of BGP routing is advertising routes that you don't know how to get to; called "black-holing" someoneIf you announce part of IP space owned by someone else, using a more-specific prefix, all their traffic flows to you. Makes that address block disconnected from the Internet.Example: black holes can happen inadvertently by non-careful aggregationISP 1100.24/16ISP 2222.2/16100.24/16222.2/16wrong !!100.24.0.0/18

  • Limitations to AggregationLack of hierarchical allocation of address space prior to CIDR (before 1995)A single AS can have noncontiguous address blocksCustomer AS and provider AS can have non-contiguous address blocksReluctance of customers to renumber their address space when they switch providersMulti-homingmulti-homed prefixes require global visibilitymay choose not to: load sharing

  • Rules of Thumb for AggregationTo avoid black holes: an ISP is not allowed to aggregate routes unless it is a supernet of the address block it wants to aggregateIn other words, an ISP has to specifically announce each IP address range of its downstream customers that are not part of its own address space.

  • Routing Instability

  • Route StabilityRouting instability: rapid fluctuation of network reachability informationroute flapping: when a route is withdrawn and re-announced repeatedly in a short period of timehappens via UPDATE messagesbecause messages propagate to global Internet, route flapping behavior can cascade and deteriorate routing performance in many placesEffects: increased packet loss, increased network latency, CPU overhead, loss of connectivityRoute Stability Studies by C. Labovitz, R. Malan & F. Jahanian

  • Types of Routing UpdatesForwarding instabilityreflects legitimate topology changes e.g., changes in Prefix, NEXT_HOP and/or ASPATHaffects forwarding paths usedPolicy fluctuationreflects changes in policy e.g., changes in MED, LOCAL_PREF, etc.may not necessarily affect forwarding paths usedPathologicalredundant messagesreflect neither topology nor policy changes

  • Anecdotes of Route Flap StormsApril 25, 1997 - small Virginia ISP injected incorrect map into global Internet. Map said Virginia ISP had optimal connectivity to all destinations. Everyone sent their traffic to this ISP. Result: shutdown of Tier-1 ISPs for 2 hours.August 14, 1998 - misconfigured database server forwarded all queries to .net to wrong server. Result: loss of connectivity to all .net servers for few hours.Nov. 8, 1998 - router software bug led to malformed routing control message. Caused interoperability problem between Tier-1 ISPs. Result: persistent pathological oscillations and connectivity loss for several hours.

  • Taxonomy (as per Labovitz et. al.)

    Name

    Type

    Character

    WADiff

    Explicit withdrawal followed by announcement. Replace route with different path

    Legitimate

    AADiff

    Announced twice (implicit withdrawal). Replace route with different path.

    Legitimate

    WADup

    Explicit withdrawal followed by announcement. Replace route with same path.

    Legitimate or pathological

    AADup

    Announced twice (implicit withdrawal). Replace route with same path.

    Policy change or pathological

    WWDup

    Repeated duplicate withdrawals

    Pathological

  • General Statistics1996: 3-5 million updates per day in Internet core1998: 300K-700K updates per day in Internet core1996: average number of announcements per day was ~275K.1998: average number of announcements per day was ~400KCorrelation of instability and usageinstability highest during business hoursinstability lowest during nights, on weekends and in summer

  • Per Event Type Statistics1996 relative impact (approximately): WWDup (96%), AADup (2%), WADup (1%), AADiff(1/2%), WADiff (1/2%) 1998 relative impact (approximately): AADup (30-40%), WWDup(25-30%), AADiff (~20%) other (rest)

  • Whos Responsible?ASsNo single AS dominates instability statistics No correlation between the size of an AS and its share of updates generated. PrefixesInstability is evenly distributed across routes. Example of measurements: 75% of AADiff events come from prefixes change less than 10 times a day.80-90% of instability comes from prefixes that are announced less than 50 times/day.

  • Sources of Instabilities in Generalrouter configuration errorstransient physical and data link problemssoftware bugsproblems with leased lines (electrical timing issues that cause false alarms of disconnect)router failuresnetwork upgrades and maintenance

  • Instability Problem and Cause. Example 1.Problem: 3-5 million duplicate withdrawalsCause: stateless BGP implementationtime-space tradeoff: no state maintained on what advertised to peerswhen receive any change, transmit withdrawal to all peers regardless of whether previously notified or notsent updates for both explicit and implicit withdrawalsBy 1998, most vendors had BGP implementations with partial state.Result: number of WWDups reduced by an order of magnitude

  • Instability Problem and Cause. Example 2Problem: duplicate announcementsCause: min-advertisement timer & stateless BGPmin-adv timer: wait 30 seconds. Combine all received updates in last 30 seconds into single outbound update message (if possible).within 30 seconds route can be withdrawn and re-announced so that there is no net change to original announcementSolution: Have BGP keep some state about recently sent messages to peers. Avoid sending duplicate messages

  • Instability Problem and Cause. Example 3AS 1border routerinternal routerAS2AS3E-BGP102356IGPNet X4Example: interaction IGP/BGPpolicy: set MED using IGP metrics, such as shortest path

  • Controlling route instability: Route Dampeningtrack number of times a route has flapped over a period of timeroutes that exhibit a high level of instability in a short period of time should be suppressed (not advertised)penalize ill behaved routes proportionally to their expected future stabilityif a suppressed route stops flapping for a long enough period of time, unsuppress it (readvertise)

  • Route Dampeningtimepenaltyreuse limitsuppress limit

  • Route Dampening AlgorithmEach time a route flaps, increase the penaltyIf the route has not flapped in the last half-life time units, then cut penalty in half.If the penalty > suppress limit, then suppress the routeIf the penalty < reuse limit, then free a suppressed route

  • Side Effect of Route DampeningA legitimate update may arrive and it will be ignored because that route has been suppressed and not yet released.The modification needed for the legitimate announcement is delayed

  • Aggregation can help route flappingIf a more-specific route is flapping, but provider only announces aggregated prefix, then other networks dont see route flap. Hence aggregation can mask route flapping.Aggregation helps combat instability because it reduces the number of networks visible in the core Internet.140.40.4/24140.40/16flappingnot flapping

  • Growth of BGP Tables

  • Long Term Growth Trends in Internet RoutingWill this routing system be able to scale and meet the growth of the Internet and its ever-expanding level of demands?Are there any inherent limitations?As more devices connect to Internet and consume addresses, the need to maintain reachability to these addresses implies larger routing tablesWhat is the ability of the system to produce a stable view of the overall network topology?

  • BGP Table Growth (1989-2001)

  • AS Number Growth

  • Reasons for Exponential GrowthData in last 3 slides from Geoff Hustons INET publications

    Measures

    Growth

    Number of ASs

    Growing exponentially - 50% yearly

    Range of addresses spanned by table

    Growing 7% yearly

    Average span of route table entry

    Growing 1 bit per 29 months

    Holes in the table

    Currently at 37%

    Number of announcements

    Growing 42% yearly

  • Increasing fine levels of routing details in BGP tableAS space growth at 50% & addresses spanned by the table growing at 7% => each AS advertising smaller address ranges/24 is fastest growth prefix in table (in absolute terms). /24 - /31 is area with fastest relative growth 1999: average span of prefix 16,000 addresses (mean prefix length 18.03) 2000: average span of prefix 12,000 addresses (mean prefix length 18.44)

  • HolesWhen both aggregated prefix and a more-specific prefix exist in the table, the more-specific prefix is called a hole.Why are holes created?Punch hole in policy of larger aggregated announcement to create a different policy for finer announcement.Load sharing in multihoming scenarioreliability via multihoming 37% of BGP table is holes.

  • Multihoming vs. ResiliencyMultiple peering relationships can be cheaper than using a single upstream providerimplies: multihoming is seen as a substitute for upstream service resiliencyImpact providers cant command a price for reliability, and thus dont spend money to engineer it into network.resiliency is becoming responsibility of customer not providerCan BGP scale adequately to continue to undertake this role?

  • ConclusionsThings are getting better (stability)router software and configuration management are maturingincreased emphasis on aggregation and route dampening are helpingThings are getting worse (scalability)multihoming is still growinginternet topology growing less hierarchicalnoise in BGP table is growing

  • Longest Prefix Match Algorithms

  • Forwarding EngineheaderpayloadPacketRouterDestination AddressOutgoing PortDest-networkPortForwarding TableRouting Lookup Data Structure65.0.0.0/8128.9.0.0/16149.12.0.0/19317

  • The Search Operation is not a Direct Lookup(Incoming port, label)AddressMemoryData(Outgoing port, label)IP addresses: 32 bits long 4G entries

  • The Search Operation is also not an Exact Match Search Hashing Balanced binary search treesExact match search: search for a key in a collection of keys of the same length.Relatively well studied data structures:

  • Example Forwarding Table 0224232-1128.9.0.0/1665.0.0.0142.12.0.0/1965.0.0.0/865.255.255.255IP prefix: 0-32 bitsPrefix length

    Destination IP PrefixOutgoing Port65.0.0.0/83128.9.0.0/161142.12.0.0/197

  • Prefixes can Overlap128.9.16.0/21128.9.172.0/21128.9.176.0/24Routing lookup: Find the longest matching prefix (aka the most specific route) among all prefixes that match the destination address.0232-1128.9.0.0/16142.12.0.0/1965.0.0.0/8Longest matching prefix

  • Difficulty of Longest Prefix Match128.9.0.0/16142.12.0.0/1965.0.0.0/8128.9.172.0/21128.9.176.0/24128.9.16.0/21

  • Lookup Rate Required12540.0OC768c2002-0331.2510.0OC192c2000-017.812.5OC48c1999-001.940.622OC12c1998-9940B packets (Mpps)Line-rate (Gbps)LineYear

  • Size of the Forwarding TableSource: http://www.telstra.net/ops/bgptable.html959697989900YearNumber of Prefixes

    Chart1

    21443

    22847

    22880

    23217

    23697

    23650

    24064

    24313

    24887

    25381

    25379

    24814

    25602

    25309

    24847

    24942

    25211

    25721

    26068

    26466

    26286

    26808

    25841

    25598

    25997

    26467

    26765

    27068

    27175

    27925

    27803

    27259

    28786

    26538

    28747

    27844

    27695

    28995

    29011

    29525

    30016

    30248

    30194

    30293

    30376

    30373

    30373

    30373

    30373

    30823

    31368

    31549

    31271

    31418

    31322

    31350

    31610

    31784

    32187

    32798

    32542

    33820

    34527

    34261

    34306

    34286

    34334

    33845

    35788

    34889

    34932

    34556

    33842

    34230

    34985

    35883

    36023

    36747

    36945

    37420

    36981

    37288

    37366

    38320

    38859

    38955

    38464

    38858

    39275

    39417

    39625

    39637

    40316

    40921

    41626

    42643

    42686

    42505

    42375

    42272

    41984

    42443

    42339

    42547

    41173

    40882

    41059

    41554

    41827

    42122

    42225

    41199

    41178

    42658

    42333

    41891

    41935

    43905

    44187

    44177

    44763

    45145

    45139

    45284

    45393

    45330

    45419

    45626

    46420

    46001

    46319

    46121

    46265

    46335

    46851

    46913

    47140

    47273

    46408

    47402

    47720

    48167

    48216

    48074

    48217

    49883

    47965

    47895

    47824

    48382

    48663

    48546

    48772

    49112

    48716

    49031

    48806

    49054

    49214

    49494

    50197

    50899

    52046

    51570

    51287

    51024

    50885

    51344

    51321

    51488

    51822

    52023

    52334

    52616

    52126

    52174

    52113

    52284

    51987

    51889

    51353

    51967

    52124

    52036

    51902

    52162

    52355

    52453

    52479

    52603

    52867

    53107

    52519

    52828

    52728

    52585

    52745

    53049

    53544

    53977

    54016

    54690

    55050

    55840

    55140

    55241

    55204

    55488

    55688

    55419

    55326

    55506

    55864

    56051

    56190

    56203

    56708

    57086

    57278

    57740

    58002

    58578

    58917

    59004

    59296

    59395

    59557

    59964

    60162

    60434

    60749

    61033

    61182

    61679

    61998

    62088

    62494

    62457

    62948

    63371

    64048

    64335

    64043

    64437

    64778

    65145

    65434

    65680

    66092

    66465

    67237

    67344

    67256

    67772

    67807

    68449

    69143

    69313

    69678

    70013

    70958

    70666

    71211

    71867

    72123

    72544

    72950

    73435

    73964

    74638

    75124

    75591

    75620

    76881

    76893

    77115

    78013

    77877

    78125

    78481

    79126

    79459

    80053

    80568

    80997

    81822

    82809

    83864

    84236

    84715

    85547

    86440

    87069

    87336

    88256

    89224

    89685

    88896

    91242

    91078

    Sheet1

    121443

    222847

    322880

    423217

    523697

    623650

    724064

    824313

    924887

    1025381

    1125379

    1224814

    1325602

    1425309

    1524847

    1624942

    1725211

    1825721

    1926068

    2026466

    2126286

    2226808

    2325841

    2425598

    2525997

    2626467

    2726765

    2827068

    2927175

    3027925

    3127803

    3227259

    3328786

    3426538

    3528747

    3627844

    3727695

    3828995

    3929011

    4029525

    4130016

    4230248

    4330194

    4430293

    4530376

    4630373

    4730373

    4830373

    4930373

    5030823

    5131368

    5231549

    5331271

    5431418

    5531322

    5631350

    5731610

    5831784

    5932187

    6032798

    6132542

    6233820

    6334527

    6434261

    6534306

    6634286

    6734334

    6833845

    6935788

    7034889

    7134932

    7234556

    7333842

    7434230

    7534985

    7635883

    7736023

    7836747

    7936945

    8037420

    8136981

    8237288

    8337366

    8438320

    8538859

    8638955

    8738464

    8838858

    8939275

    9039417

    9139625

    9239637

    9340316

    9440921

    9541626

    9642643

    9742686

    9842505

    9942375

    10042272

    10141984

    10242443

    10342339

    10442547

    10541173

    10640882

    10741059

    10841554

    10941827

    11042122

    11142225

    11241199

    11341178

    11442658

    11542333

    11641891

    11741935

    11843905

    11944187

    12044177

    12144763

    12245145

    12345139

    12445284

    12545393

    12645330

    12745419

    12845626

    12946420

    13046001

    13146319

    13246121

    13346265

    13446335

    13546851

    13646913

    13747140

    13847273

    13946408

    14047402

    14147720

    14248167

    14348216

    14448074

    14548217

    14649883

    14747965

    14847895

    14947824

    15048382

    15148663

    15248546

    15348772

    15449112

    15548716

    15649031

    15748806

    15849054

    15949214

    16049494

    16150197

    16250899

    16352046

    16451570

    16551287

    16651024

    16750885

    16851344

    16951321

    17051488

    17151822

    17252023

    17352334

    17452616

    17552126

    17652174

    17752113

    17852284

    17951987

    18051889

    18151353

    18251967

    18352124

    18452036

    18551902

    18652162

    18752355

    18852453

    18952479

    19052603

    19152867

    19253107

    19352519

    19452828

    19552728

    19652585

    19752745

    19853049

    19953544

    20053977

    20154016

    20254690

    20355050

    20455840

    20555140

    20655241

    20755204

    20855488

    20955688

    21055419

    21155326

    21255506

    21355864

    21456051

    21556190

    21656203

    21756708

    21857086

    21957278

    22057740

    22158002

    22258578

    22358917

    22459004

    22559296

    22659395

    22759557

    22859964

    22960162

    23060434

    23160749

    23261033

    23361182

    23461679

    23561998

    23662088

    23762494

    23862457

    23962948

    24063371

    24164048

    24264335

    24364043

    24464437

    24564778

    24665145

    24765434

    24865680

    24966092

    25066465

    25167237

    25267344

    25367256

    25467772

    25567807

    25668449

    25769143

    25869313

    25969678

    26070013

    26170958

    26270666

    26371211

    26471867

    26572123

    26672544

    26772950

    26873435

    26973964

    27074638

    27175124

    27275591

    27375620

    27476881

    27576893

    27677115

    27778013

    27877877

    27978125

    28078481

    28179126

    28279459

    28380053

    28480568

    28580997

    28681822

    28782809

    28883864

    28984236

    29084715

    29185547

    29286440

    29387069

    29487336

    29588256

    29689224

    29789685

    29888896

    29991242

    30091078

    Sheet2

    Sheet3

  • Longest Prefix Match is Harder than Exact MatchThe destination address of an arriving packet does not carry with it the information to determine the length of the longest matching prefixHence, one needs to search among the space of all prefix lengths; as well as the space of all prefixes of a given length

  • LPM in IPv4Use 32 exact match algorithms for LPM!Exact matchagainst prefixes of length 1Exact matchagainst prefixes of length 2Exact matchagainst prefixes of length 32PortPriorityEncodeand pick

  • Metrics for Lookup AlgorithmsSpeed (= number of memory accesses)Storage requirements (= amount of memory)Low update time (support >10K updates/s)ScalabilityWith length of prefix: IPv4 unicast (32b), Ethernet (48b), IPv4 multicast (64b), IPv6 unicast (128b)With size of routing table: (sweetspot for todays designs = 1 million) Flexibility in implementationLow preprocessing time

  • Radix Trie P2P3P4P1ABCGDFHE1001111next-hop-ptr (if prefix)left-ptrright-ptrTrie node

    P1111*H1P210*H2P31010*H3P410101H4

  • Radix TrieW-bit prefixes: O(W) lookup, O(NW) storage and O(W) update complexityAdvantages

    SimplicityExtensible to wider fieldsDisadvantages

    Worst case lookup slowWastage of storage space in chains

  • Leaf-pushed Binary TrieABCGDE10011left-ptr or next-hopTrie noderight-ptr or next-hopP2P4P3P2P1

    P1111*H1P210*H2P31010*H3P410101H4

  • PATRICIA2ABCE101Patricia tree internal node3P3P2P4P1100FGD5bit-positionleft-ptrright-ptrLookup 10111Bitpos 12345Practical Algorithm To Retrieve Information Coded In Alphanumeric

    P1111*H1P210*H2P31010*H3P410101H4

  • PATRICIAW-bit prefixes: O(W2) lookup, O(N) storage and O(W) update complexityAdvantages

    Decreased storage Extensible to wider fieldsDisadvantages

    Worst case lookup slowBacktracking makes implementation complex

  • Path-compressed Tree1, , 2ABC1010,P2,4P4P110ED1010,P3,5bit-positionleft-ptrright-ptrvariable-length bitstringnext-hop (if prefix present)Path-compressed tree node structure

    P1111*H1P210*H2P31010*H3P410101H4

  • W-bit prefixes: O(W) lookup, O(N) storage and O(W) update complexityAdvantages

    Decreased storage Disadvantages

    Worst case lookup slowPath-compressed Trie

  • Binary Search by Prefix Intervals{}, 1, 10, 100, 101, 1110{}: 00000-111111: 10000-1111110: 10000-10111100: 10000-10011101: 10100-101111110: 11100-111011100

  • RulesIf the destination address we search for fits to the right of a left paranthesis, the prefix represented by the left paranthesis is the longest matchOtherwise, increment for each right paranthesis, decrement for each left paranthesis until the count reaches -1

  • Example revisited1100121210-11100 is in the interval 10000-11111, longest prefix match is for the prefix 1

  • Binary Search on IntervalsAdvantages

    Storage is linearCan be balancedLookup time independent of WDisadvantages

    But, lookup time is dependent on NIncremental updates more complex than triesEach node is big in size: requires higher memory bandwidthW-bit N prefixes: O(logN) lookup, O(N) storage

    BGP is a protocol that happens between 2 nodes. It uses TCP as its transport protocol. In other words, 2 routers establish a BGP session using TCP and then exchange BGP messages.

    During session establishment routers identify themselves. Each neighbor can accept or reject the connection.Instead of announcing all routes individually ...The 2nd example is the case that shows how it works on arbitrary bit boundaries.Need to take advantage of the addressing schemeLast bullet: as is typically done in DV protocols.

    Because communicated to all IBGP routers within AS, all routers have a common view of how to exit the AS.This differs from MED in 2 ways: (1) the destination prefix can be anywhere in the internet, not just in the next AS (as in the case for MED). (2) the AS that sets Local_Pref, is also the one that uses it. This allows one node to tell everyone locally what the best way out is.

    MED cant be used in this example because there is exactly one connection between any pair of ASs.

    Routes used by router are the best routes, and these are the ones that become candidates to readvertise.Routing Policy is used to enforce business agreements.As soon as you become multihomed you have decide what your policy is about transit traffic.

    I-BGP sessions need not correspond to physical links the logical TCP session can traverse a few hops E-BGP sessions usually do correspond to a physical link.

    Clusters should be set up to reflect the underlying topology.

    If an RR goes down then the whole cluster is disconnected. For reliability purposes, most clusters are configured with 2 RRs where one functions as backup.This choice has a huge impact on load sharingTraffic magnets: all parts of the internet see the same length prefixShould give some examples.For case 2, 1.2.2.0/24 and 1.2.3.0/24 are aggregatable BUT1.2.7.0/24 and 1.2.8.0/24 are not.WADup is oscillating reachability announcementsIn 1998 things are more normal because there are now more announcements than withdrawals.The stateless BGP implementation was compliant with the standard.1. Net X is announced thru R7 only. R4 and R3 set MED to be shortest path according to ISIS metric. AS1 chooses path thru R4. All is fine.2. Assume Net X is announced via both R7 and R8. If link R5-R7 oscillates, then the MEDs announced over the EBGP session between R4-R1 will change.The growth of holes is a significant driver of overall table growth.In the real world, there are essentially 3 dimensions:cost vs features vs time-to-design

    In future presentations, need to make the slides containing adv and disadv merged into the main slides.