INFRASTRUCTUREINFRASTRUCTURE
Edge Fabric:Steering Oceans of Content to the world
Robel KitabaNetwork Engineer, Facebook
Locations just for visualization purposes, it does not reflect current configuration.
Global Load BalancerManages ingress traffic
Locations just for visualization purposes, it does not reflect current configuration.
Latency based telemetry (SONAR)
PX
Network
Bac
kbon
e
TransitPNI
PoP: Point of Presence (colo facilities)
PNI Links: Direct peering with user networks
PX Links: Peering with networks over shared infrastructure
Transit Links: Peering with intermediate networks that provide global reachability
Total egress capacity at PoP
Total traffic at PoP
1 Day
Total egress capacity at PoP
Total traffic at PoPCapacity for iface@PoPDemand for iface@PoP
1 Day
>250%
Drops
Why demands exceeds capacity
Peering with other networks using BGP
Local Preference
Med
AS Path length
Communities
BGP (STATIC)
best BGP path
POP
Why demands exceeds capacity
Peering with other networks using BGP
Local Preference
Med
AS Path length
Communities
Traffic demand changes
Limited capacity
Performance variations
Transient failures
BGP (STATIC) REALITY (DYNAMIC)
best BGP path UnusedOverloaded
POP
Local Edge ControllerEdge Fabric
"Engineering Egress with Edge Fabric: Steering Oceans of Content to the World", Brandon Schlinker et al, SIGCOMM 2017
LOCAL CONTROLLER’S JOURNEY
PNI Transit 1PX
Manual interventions to change BGP policy when there were failures in PNIs
Setup MPLS paths from end hosts to PRs in order to choose egress links
Use DSCP marking at the end hosts to indicate egress link
not scalable, too slow, error prone
Restrictions on hw
Not scalable, coordination of config, rigid assumptions
V0
V1
V2
V0 V1 V2 V3 V4
Rack
Rack
Rack
Transit 2
Network 1
Use GRE tunnels from end hosts to PRsV3 Coordination of config, vendor bugLOCAL
CONTROLLER
PEERING ROUTER
EDGE CLUSTER
LOCAL CONTROLLER’S JOURNEY
Network 1
PNITransitPX
Manual interventions to change BGP policy when there were failures in PNIs
Setup MPLS paths from end hosts to PRs in order to choose egress links
Use DSCP marking at the end hosts to indicate egress link
Use GRE tunnels from end hosts to PRs
Use BGP injections at PRs
not scalable, too slow, error prone
Restrictions on hw
Not scalable, coordination of config, rigid assumptions
Coordination of config, vendor bug
Flexible, dynamic, decouples decisions from PoP architecture
V0
V1
V2
V3
V4Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
V0 V1 V2 V3 V4
LOCAL CONTROLLER
EDGE CLUSTER
PEERING ROUTER
Dest 1.2.3.0/24LocalPref 500
ASPath 100
Nexthop 42.1.3.1
Community 100:1
Dest 1.2.3.0/24LocalPref 200
ASPath 7018,100
Nexthop 201.2.4.12
Community 7018:1
1.2.3.0/24
BGP INJECTION MODE
PEERING ROUTER TRANSIT
PNI
Dest 1.2.3.0/24LocalPref 500
ASPath 100
Nexthop 42.1.3.1
Community 100:1
Dest 1.2.3.0/24LocalPref 200
ASPath 7018,100
Nexthop 201.2.4.12
Community 7018:1
1.2.3.0/24
EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
BGP INJECTION MODE
PEERING ROUTER TRANSIT
PNIBGP Session
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
1.2.3.0/24
BGP INJECTION MODE
PEERING ROUTER TRANSIT
PNI
Dest 1.2.3.0/24LocalPref 500
ASPath 100
Nexthop 42.1.3.1
Community 100:1
Dest 1.2.3.0/24LocalPref 200
ASPath 7018,100
Nexthop 201.2.4.12
Community 7018:1
BGP Session
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
1.2.3.0/24
EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
BGP INJECTION MODE
PEERING ROUTER TRANSIT
PNI
Dest 1.2.3.0/24LocalPref 500
ASPath 100
Nexthop 42.1.3.1
Community 100:1
Dest 1.2.3.0/24LocalPref 200
ASPath 7018,100
Nexthop 201.2.4.12
Community 7018:1
BGP Session
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Dest 1.2.3.0/24LocalPref 50000ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Dest 1.2.3.0/24LocalPref 50000
ASPath 7018,100
Nexthop 201.2.4.12
Community 7018:1
1.2.3.0/24
EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
BGP INJECTION MODEDest 1.2.3.0/24LocalPref 500
ASPath 100
Nexthop 42.1.3.1
Community 100:1
Dest 1.2.3.0/24LocalPref 200
ASPath 7018,100
Nexthop 201.2.4.12
Community 7018:1
PEERING ROUTER TRANSIT
PNIBGP Session
Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Dest 1:2400::/34LocalPref 50000ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
PEERING
TRANSIT
1:2400::/24EF CONTROLLER
Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Split prefix traffic
Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Dest 1:2400::/34LocalPref 50000ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
PEERING 1:2400::/34
TRANSIT
1:2400::/24EF CONTROLLER
Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1
Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1
Split prefix traffic
SYSTEM ARCHITECTURE
prefix via v.x.y.z
Interface Info (SNMP)
Traffic Rates (Netflow/Sflow)
BGP Routes (BMP)
Policy & Config
Topology Info (FBNet)
Controller
Peering Routers
Route Overrides
BGP Injector
w/ Audits to make it more robust
BMP Audit Netflow Audit
Injector AuditRoute Audit
Total egress capacity at PoP
Total traffic at PoPCapacity for iface@PoPDemand for iface@PoP
1 Day
Total egress capacity at PoP
Total traffic at PoP
Capacity for iface@PoPDemand for iface@PoP
1 DayTraffic on iface@PoP w/Edge Fabric
Avoid packet drops while maintaining high link utilization
Looking beyond Facebook's network
Local Preference
Med
AS Path length
Communities
Traffic demand changes
Limited capacity
Performance variations
Transient failures
BGP (STATIC) REALITY (DYNAMIC)
Best BGP Path
POP
Facebook’s Network
?
Performance RoutingAlternative Path Measurements
Network 1
PNITransitPX
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Collect TCP stats for transactions (RTT, packet loss, throughput)
Network 1
PNITransitPX
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Collect TCP stats for transactions (RTT, packet loss, throughput)
Allow us to monitor performance only to the primary path
PNITransitPX
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Collect TCP stats for transactions (RTT, packet loss, throughput)
Allow us to monitor performance only to the primary path
Send a very small portion of traffic over alternate paths
Network 1
Mark random flows with special DSCP values
PNITransitPX
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Network 1
Mark random flows with special DSCP values
Configure alternate routing tables per DSCP value
PNITransitPX
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Network 1
Mark random flows with special DSCP values
Insert routes into the alternate routing tables
APM CONTROLLER
PNITransitPX
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Rack
Configure alternate routing tables per DSCP value
Network 1
Temporary congestion of the primary path
Interesting Examples
1 Day
thro
ughp
ut
Alternate path 2
Alternate path 1
Primary path
Public Exchange Performance problem
AS 300 AS 400
AS 32934
AS 100 AS 200
Peer’s capacity is unknown
PX
??
? ?
Public Exchange Performance problem
AS 300 AS 400
AS 32934
AS 100 AS 200
Peer’s capacity is unknown
PX
Path Performance Monitoring Service
Computes effective Peer’s capacity on PX
HTTP TCP Stats
BGP Routes
Stats Aggregator
Traffic Rates
Capacity limit computation
Public Exchange Performance problem
AS 300 AS 400
AS 32934
AS 100 AS 200
Infer how much traffic to send without overwhelming the peer
PX
ENHANCE EDGE FABRIC W/ PERFORMANCE
prefix via v.x.y.z
Interface Info (SNMP)
Traffic Rates (Netflow/Sflow)
BGP Routes (BMP)
Policy & Config
Topology Info (FBNet)
Performance Limits
Controller
Peering Routers
Route Overrides
BGP Injector
BMP Audit Netflow Audit
Injector AuditRoute Audit
Thanks