Overlay Networking with VXLAN
Frankie Lim @ Arista.com
Needs for an Overlay Networks Logical Network (aka “Overlay” Network) § Network Virtualization (SDN)
§ Abstracts the virtualized environment form the physical topology
§ Constructs Layer 2 tunnels across the physical infrastructure
§ Tunnels provide connectivity between physical and virtual end-points
Physical Network (aka “Underlay” Network) § Transparent to the overlay technology
§ Allows the building of L3 infrastructure – No L2
§ Physical provide the bandwidth and scale for the communication
§ Removes the scaling constraints of the physical from the virtual
Physical Infrastructure
Overlay Networks
Introducing VXLAN (RFC 7348) Virtual eXtensible LAN (VXLAN RFC 7348)
§ IETF framework proposal, co-authored by Arista, Broadcom, Cisco, Citrix Red Hat & VMware
Provides Layer 2 “Overlay Networks” on top of a Layer 3 network
§ “MAC in IP” Encapsulation
§ Layer 2 multi-point tunneling over IP UDP
Tunnel End-Points (VTEPs) perform encapsulation/decapsulation
§ In Software e.g. Hypervisor vSwitch
§ In Hardware e.g. Leaf Switches
Enables Layer 2 interconnection across Layer 3 boundaries
§ Transparent to the physical IP network
§ Provides Layer 2 scale across the Layer 3 IP fabric
§ Abstracts the Virtual connectivity from the physical IP infrastructure
§ e.g. Enables VMotion, L2 clusters etc. across standards based IP fabrics
VM-1 10.10.10.1/24
VM-2 20.20.20.1/24
VM-3 10.10.10.2/24
VM-4 20.20.20.2/24
ESX host ESX host
Subnet A
Layer 2 (e.g. for VM mobility, storage access, clustering etc.)
Across Layer 3 subnets
NAS 20.20.20.324
Load Balancer 10.10.10.3/24
Subnet B
VXLAN Terminology Virtual Tunnel End-point (VTEP)
§ Performs for VXLAN encapsulation & decapsulation of the native frame
§ Adds the the appropriate VXLAN header.
§ Can be implemented on software virtual switch or a physical switch.
Virtual Tunnel Identifier (VTI)
§ An IP interface used as the Source IP address for the encapsulated VXLAN traffic
§ The destination IP address for VXLAN encapsulated traffic
Virtual Network Identifier (VNI)
§ A 24-bit field added within the VXLAN header.
§ Identifies the Layer 2 segment of the encapsulated Ethernet frame
VXLAN Header
§ The IP/UDP VXLAN header added by the VTEP
§ Uses a UDP source port based on a hash of the inner frame to create entropy for ECMP
Software VTEP
Hardware���VTEPs
VTEP
IP address: x.x.x.x
VTI-A
VTI-B
VTI-C
VTEP
VTEP
IP address: z.z.z.z
IP address: y.y.y.y
VXLAN + IP/UDP header SRC IP VTI-A; DST IP VTI-C
Logical Layer 2 Network
VNI n.n
VXLAN Encapsulated Frame Format
§ Ethernet header uses local VTEP MAC and default router MAC (14 bytes plus 4 optional 802.1Q header)
§ The VXLAN encapsulation source/destination IP addresses are those of local/remote VTI (20 bytes) § UDP header, with SRC port hash of the inner Ethernets header, destination port IANA defined (8
bytes) • Allows for ECMP load-balancing across the network core which is VXLAN unaware.
§ 24-bit VNI to scale up to 16 million for the Layer 2 domain or “Virtual Wires” (8 bytes)
Src. ���MAC addr.
Dest. ���MAC addr. 802.1Q. Dest. IP Src. IP UDP
VNI (24 bits) Payload FCS
Src. ���MAC addr.
Dest. ���MAC addr.
Optional ���802.1Q.
Original Ethernet Payload (including any IP headers etc.)
VXLAN (IP-MAC) Encapsulation
Ethernet Frame
VXLAN Overlay Networks Fixed Configuration,
Active-Active Layer 3 design for scale, using well known
management tools/protocols
Flexible VTEP Edge, Mobile, agile, for flexible provisioning via Cloud
Management Platforms (CMP)
VXLAN Overlay Architecture configuration/flexibility at the edge, and transparency and fix configuration in the IP fabric
VXLAN VNI 10
VTEP VTEP
VXLAN VNI 20
VTEP
VLXAN VTEP within the Hypervisor vSwitch
§ VXLAN encapsulation & de-capsulation performed by the vSwitch • Encapsulation performed prior to packet hitting the “physical interface”
• Physical network is unaware of the encapsulated content
- Sees only IP headers
§ External routing via decapsulation ���on the software switch - Based on VNI to VLAN mapping
128.218.11.x 128.218.10.x
10.10.1.4 10.10.1.5 10.10.1.6
Locally Switched Traffic is done without
encap/decap
vSwitch is responsible for encapsulation &
decapsulation of VXLAN traffic between hosts
Software Router Responsible for external routing
Physical Infrastructure
Virtual Switch (VTEP)
SW VTEP: VNI to VLAN ���
translation Virtual Switch
(VTEP)
Switch based VXLAN Gateway Architecture
UDP 4729 VTI 1 10.10.1.1 VTEP
VNI 200 VNI 2000 VNI 20000
VLAN 100 VLAN 200 VLAN 300 VLAN 400 VLAN 500
Ethernet Ports���Port Channels
Ethernet Ports���Port Channels
Ethernet Ports���Port Channels
Ethernet Ports���Port Channels
Ethernet Ports���Port Channels
Local Devices Local Devices Local Devices Local Devices Local Devices
Ethernet Ports Port Channels Spine/Leaf Switch
Ports Ports
Point & Multi-Point Tunnel Service
UDP 4729
1 2.2.2.3
VTEP
VTEP Devices Ports
VLAN 100
Devices Ports VLAN 500
VTI 2.2.2..1
Devices Ports VLAN 200
Devices Ports VLAN 300
Devices Ports VLAN 400
VTEP Ports
Ports
VTI 1 2.2.2.2
Ports
Ports
Ports
UDP 4729
Devices
Devices
Devices
Devices
Devices
VNI 2000
VNI 200
VNI 2000
VNI 20000
VNI 200
VNI 2000
Ports
VNI 2000
UDP 4729
VLAN to VNI mappings are local to switch – inbuilt support for VLAN translation
VLAN 100
VLAN 500
VLAN 200
VLAN 300
VLAN 400
VLAN 400
Ports
Devices Devices Devices
VLAN 500
Ports
VLAN 300
Devices
Ports
VLAN 200
Devices
Ports
VLAN 100
Ports
Devices
Switch Switch
Switch
VXLAN – Control Plane
VXLAN Control Plane Options
§ SDN Controller or Controller-less § The VXLAN control plane is used for MAC learning and packet flooding
• Mechanism to discover hosts residing behind remote VTEPs
• How to discover VTEPs and their VNI membership
• The mechanism used to forward Broadcast and multicast traffic within the Layer 2 segment (VNI)
IP Multicast Control Plane
• VTEP join an associated IP multicast group (s) for the VNI(s)
• Unknown unicasts forwarded to VTEPs in the VNIs via IP multicast
• Support for Third-party VTEP(s)
• Flood and learn and requires IP multicast support – limited deployments
HeadEnd Replication (HER)
• BUM traffic replicated to each remote VTEPs in the VNIs
• Replication carried out on the ingress VTEP.
• Support for Third-party VTEP(s)
• MAC learning still via flood and learn but no requirement for IP multicast
HER with Controller • Local learnt MACs and VNI binding published to Controller
• Controller dynamically distributes state to remote VTEPs
• Support for Third-party VTEP(s)
• Dynamic MAC distribution, automated flood-list provisioning
• HA Cluster support for resiliency
eVPN Model • BGP used to distribute local MAC to IP bindings between VTEPs
• Broadcast traffic handled via IP multicast or HER models
• Dynamic MAC distribution and VNI learning, configuration can be BGP intensive
• Support for Third-party VTEP(s)
VXLAN BUM Forwarding and Learning…
§ The RFC Model • Remote VM MAC ßàVTEP association learnt via IP multicast
• VTEP with a given VNI joins associated (*,G) group
• Broadcast, Unknown & Multicast traffic for a VNI sent to the IP multicast group
• Local VTEP “learns” MAC to remote VTEP IP bonding
• Once bonded traffic is unicast via standard Layer 3 protocol
VM4@VNI10@VTEP-B VM5@VNI20@VTEP-B VM6@VNI30@VTEP-B VM7@VNI10@VTEP-C���VM8@VNI20@VTEP-C VM9@VNI30@VTEP-C
Multicast (*,G) tree for VNI 10 Multicast (*,G) tree for VNI 20 Multicast (*,G) tree for VNI 20
VM1 VNI10
VM2 VNI20
VM3 VNI300
VM4 VNI10
VM5 VNI20
VM6 VNI30
VM7 VNI10
VM8 VNI20
VM9 VNI30
VTEP-A
VTEP-B
VTEP-C
Requires an IP Multicast ���Enabled Physical Network!
Note: Arista supports single (*,G) group + HER. All other platforms use HER
Unicast to VTEP-4
VTEP flood list on VTEP-‐1 VNI 2000 à VTEP-‐3 VNI 2000 à VTEP-‐4
VTEP flood list on VTEP-‐3 VNI 2000 à VTEP-‐1 VNI 2000 à VTEP-‐4
VTEP flood list on VTEP-‐4 VNI 2000 à VTEP-‐1 VNI 2000 à VTEP-‐3
VTEP creates a unicast frame for each VTEP in
the flood-list of the specific VNI
BUM traffic
VTEP flood list manually configured on each VTEP for each VNI
BUM traffic received ���locally on VTEP
VTEP learns inner MAC and maps to the outer SRC IP (remote VTEP)
Separate unicast on the wire for each VTEP in the VNI
1
2
3
4
VTEP2
VTEP3
VTEP4
VNI 2000
VXLAN Head End Replication
VTEP1
Unicast to VTEP-3
Eliminates the need for an IP Multicast Enabled Physical Network!
VLAN 200 Eth 2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ VTEP Config Source-‐IP 1.1.1.2/32 VLAN 500 à VNI 2000
Overlay Network
§ VLAN to VNI mapping of a VTI is only locally significant • Local 802.1Q VLAN Tag is stripped prior to VXLAN encapsulation
• Allows for a single VLAN tag to be mapped to different VNIs on different switches • Providing VLAN translation across a VNI and scale beyond the traditional 4k+ VNIs
VLAN 20 Eth 2, Eth3 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ VTEP config VLAN 20 à VNI 2000
VLAN 20 VLAN 500
VNI 2000
VLANs (1-3K) POD significant)
VLANs (1-3K) POD significant)
VLANs (1-3K) POD significant)
VNIs Mapping 5k-8K
VNIs Mapping 9k-12K
VNIs Mapping 12k-15K
Scaling beyond 4K VLANs
Across POD DC wide VNIs VLAN 3k +
VLAN Translation between VTEPs
VTEP VTEP
Eth 2 Eth 2 Eth 3
SDN Controllers for VXLAN
CVX + NSX • Centralized database of
physical infrastructure collected on CVX
• CVX state (MAC, VNIs, HW VTEPs) shared with NSX
• Centralized provisioning and controller via the NSX controller
• Solution for scalable dynamic DCs with HW to SW VTEP automation
• Advantages within an ESXi estate
CloudVision
eXchange
CVX + Nuage • Centralized database of
physical infrastructure collected on CVX
• CVX state (MAC, VNIs, HW VTEPs) shared with the VSC
• Centralized provisioning and controller via the VSC controller
• Solution for scalable dynamic DCs with HW to SW VTEP automation
• Targeted for a Zen, KVM estate
CVX + OpenStack • Centralized database of
physical infrastructure collected on CVX
• ML2 plugin for communication between CVX and OpenStack
• Provisioning of the physical infrastructure from OpenStack
• Solution for small to medium DCs with VTEP automation
• Targeted for a Zen, KVM estate
CloudVision
eXchange CloudVision
eXchange
OVSDB OVSDB ML2 plugin
VXLAN – Layer 2 Services
VXLAN Bridging
§ Provides layer 2 connectivity (P2P, P2M) over the layer 3 spine/leaf network § Allowing any-to-any Layer 2 connectivity between DCs, racks, servers, devices, VMs § Layer 2 connectivity provided by VXLAN encapsulation at the leaf nodes – VXLAN
VTEP(s)
Subnet/VLAN A
Subnet/VLAN B
Spine
Subnet/VLAN A
Subnet/VLAN B
VXLAN VNI – Layer 2
VXLAN VNI – Layer 2
Leaf
VTEP
VTEP VTEP VTEP
VXLAN Bridging Operation
§ Standard local switching via the VLAN configuration on the VTEP § Extend the Layer 2 domain by mapping the VLAN ID to the VXLAN VNI § VLAN to VNI mapping is only locally significant, VLAN tag is not carried in the
VXLAN frame
§ Host learnt on the remote VTEP, VXLAN encapsulated by the VTEP and routed to the remote VTEP
VLAN 10
MAC-1
MAC 2
Leaf-1
Serv-1 MAC-1
VLAN 10 à VNI 1010
802.1Q VLAN 10
L3 Backbone
VNI 1010
VNI 1010 -> VLAN 20
Serv-2 MAC-2
Leaf-2
Inner Eth Frame
VNI 1010
2.2.2.2
2.2.2.1
VLAN 20
MAC-1
MAC 2
802.1Q VLAN 20
Layer 2 Domain (eg, 193.10.10.0/24)
VTEP 2.2.2.2
Eth-49 Eth-1 VTEP
2.2.2.1
Eth-1 Eth-49
Active/Active Dual-
homing Rack-1
VXLAN Bridging – Resiliency with dual-homing
§ For host resiliency single Logical VTEP can be created across the Active/Active Dual-homing domain
§ Providing active-active VXLAN encap and decap across the two physical switches
VTI
VTI Eth-1
Eth-1 VTI
VTI Eth-1
Eth-1
L3 Backbone
VNI 1010
Inner Eth Frame
VNI 1010
2.2.2.2
2.2.2.1
Rack-2
VLAN 10
MAC-1
MAC 2
VLAN 20
MAC-1
MAC 2
VLAN 10 à VNI 1010 VNI 1010 -> VLAN 20
Serv-1 MAC-1
Serv-2 MAC-2
Layer 2 Domain
VTEP
2.
2.2.
2 VTEP 2.2.2.1
Eth-49
Eth-49
Eth-49
Eth-49
Leaf-11
Leaf-12
Leaf-21
Leaf-22 Active/Active Dual-
homing
S-VLAN to VNI mapping • Mapping of outer S-Tag to single
VNI • Inner C-Tag are transported within a
single VNI • The inner VLAN ID are carried on
VXLAN encap frame • Ability to transport all customer
VLANs across a single VXLAN point to point link
Switchport mode dot1q-tunnel
VXLAN Bridging - VLAN to VNI Service Mapping
VLAN to VNI mapping • One to One mapping between VLAN ID
and the VNI
• Mapping is only locally significant,
• VLAN ID not carried on VXLAN encap frame
• Allows VLAN translation between remote VTEPs
Port + VLAN to VNI mapping • Mapping traffic to a VNI based on a
combination of the ingress port and it VLAN-ID
• The VLAN ID is not carried on VXLAN encap frame
• Provides support for overlapping VLANs within a single VTEP to be mapped to different VNIs
Leaf-1 VNI 1020
VNI 1010
VLAN 10
VLAN 20
VLAN 10 -> VNI 1010
VLAN 20 à VNI 1020
Leaf-1 VTEP
VNI 1030 C-tag 10,20
VLAN 10,20
S-VLAN 30 -> VNI 1030
Leaf-1 VTEP
VLAN 10
Eth-1 VLAN 10 -> VNI 1010 Eth-2 VLAN 10 à VNI 1020
Eth-1
VLAN 10
Eth-2
VNI 1020
VNI 1010
VTEP Eth-1
VLAN 30
VXLAN Bridging – STP Behavior
§ STP BPDU’s are not transported across the VXLAN tunnel § Creating Separate STP domains within the local ports of each VTEP
Leaf-1
Serv-1
802.1Q VLAN 10
L3 Backbone
VNI 1010
Serv-2
Leaf-2
802.1Q VLAN 10
Layer 2 Domain
Spanning Tree Domain 1
STP BPDU
Root Bridge leaf 1
Cost 0 VLAN 10 à VNI
1010
VNI 1010 à VLAN 10
STP BPDU
Root Bridge leaf 2
Cost 0
Spanning Tree Domain 2
VTEP 2.2.2.2
VTEP 2.2.2.1
Eth-1 Eth-49 Eth-49 Eth-1
VXLAN Bridging – Quality of Service
§ Standard ingress policy used to define DSCP of outer frame § Trusted or Untrusted configuration of ingress interface used to derive outer CoS/DSCP
value § Any re-write action applied to only the inner frame NOT the outer frame
§ Outer CoS value derived from the Traffic Class map
Leaf-1
Eth-1 Eth-49
DSCP Trusted Interface
CS1 (8)
DSCP to TC mapping : CS1 à TC 0
CS1 (8)
Outer
CS1 (8)
inner
Leaf-1
Eth-1 Eth-49
DSCP Untrusted Interface (with Re-write)
CS4 (32)
DSCP to TC mapping: CS3 à TC 3 TC to DSCP Rewrite : TC 3 à AF21 (18)
CS3 (24)
Outer
AF21 (18)
inner
Default interface CoS = CS3 (24)
VTEP VTEP
VXLAN Bridging – Use Case 1
Interconnect Islands within the DC or across geographically disperse sites • Providing VM workload mobility within DC and inter DCs
• Workload migration, VM bursting (eg hybrid cloud), business continuity across DCs
DCI to provide Layer 2 connectivity between geographically disperse sites
Server migration POD interconnect for connectivity between DC’s PODs
Layer 2 Domain Layer 2 Domain
VNI
VNI 802.1Q
VTEP
802.1Q
VTEP
VXLAN Bridging - Use case 2
VXLAN as a Layer 2 Service within a Leaf Spine • Interconnect disperse subnets with Layer 3 to 7 services – NFV service chaining
• Providing a logical multi-tiered network regardless of physical location
Server Leaf Server Leaf Tenant L3 Node NFV Services Leaf
Firewall
Load-balancer Firewall
VNI 1010 VNI 1020
VNI 1030
Tenants logical C
onnectivity
VNI
Layer 2 VNI Layer 2
VNI
Spine
VTEP VTEP VTEP
VTEP
VXLAN – Layer 3 Services
VXLAN Routing VXLAN Bridging Model
§ Routing achieved via a centralized node
§ Requiring a dedicated routing node within the leaf-spine fabric
§ Sub-optimal traffic forwarding to traffic tromboning
VXLAN Routing model § Routing achieved at the leaf Layer VTEP nodes
§ No additional external routing nodes required
§ Optimized routing with the reducing of traffic tromboning
§ Not supported by MPLS VLL/VPLS
Server Leaf Dedicated L3 Node
VNI 1010 VNI 1030
Server Leaf Server Leaf
Spine
Server Leaf
VNI 1010
VNI 1020 Route directly at the leaf
Server Leaf Server Leaf
Dedicated Router, sub-optimal forwarding Routing at the leaf, providing optimal forwarding
VTEP VTEP VTEP VTEP VTEP
Spine
VTEP VTEP VTEP VTEP
What is VXLAN Routing?
§ SVI configured on the VLAN which is VXLAN enabled § SVI can be placed in a non-default VRF to support overlapping IPs and multi-
tenancy § Note VXLAN routing support is required on the platform even when next-hop
host(s) are local
Serv-1 10.10.10.100
GW 10.10.10.1
SVI VLAN 10 10.10.10.1
802.1Q VLAN 10
SVI VLAN 20 10.10.20.1 Serv-2
10.10.20.100 GW 10.10.20.1
VNI 1020
VXLAN Bridging
Routing + VXLAN Encap
802.1Q VLAN 20 VTEP
2.2.2.2 VTEP
2.2.2.1
Leaf-1 Leaf-2
VXLAN Routing - Operation
10.10.10.100
10.10.20.100
VLAN 10
MAC-1
MAC -3
VNI 1020
10.10.10.100
10.10.20.100
VLAN 10
MAC-4
MAC-2
VNI 1020
2.2.2.2
2.2.2.1
10.10.10.100
10.10.20.100
VLAN 20
MAC-4
MAC -2
1. SVI 10 Gateway for Serv-1. Routes packet into subnet 10.10.20.0, resulting in a Src MAC of MAC-4 and Dest MAC of MAC-2
10.10.10.100
10.10.20.100
VLAN 20
MAC-4
MAC-2
2. VTEP-1 learns Dest MAC (MAC-2) via remote VTEP=2 (2.2.2.2). VXLAN encaps the frame with a Dest-IP of 2.2.2.2
3. VTEP-2 maps VNI 1020 to VLAN 20. MAC lookup of MAC-2 points to Eth-6. VXLAN header removed and forwarded to Serv-2
4. Packet forward to Serv-2 tagged based on the Local VLAN to VNI mapping
Serv-1 10.10.10.100
GW 10.10.10.1 MAC-1
Serv-2 10.10.20.100
GW 10.10.20.1 MAC-2
802.1Q VLAN 20
802.1Q VLAN 10
SVI VLAN 10 10.10.10.1
MAC-3
SVI VLAN 20 10.10.20.1
MAC-4
VNI 1020
VNI 1020 à VLAN 20
VXLAN Bridging
VTEP-1 2.2.2.1
VTEP-2 2.2.2.2
VXLAN Routing - Forwarding models for Trident2 platform
§ Single re-circulation required. • 1st pass of ASIC to route frame
• 2nd pass of ASIC for VXLAN encapsulation
VXLAN Routing – Route and VXLAN encapsulation Local host to a remote host
VLAN 10 à VLAN 20 à VNI 1020
VLAN 10
VXLAN Routing – VXLAN de-encapsulate and route Remote host routed to a local host and switch is the DFG for the remote host
VXLAN Routing – VXLAN de-encapsulate, route and VXLAN encapsulate Switch is the DFG for two remote hosts on different subnets
§ Two re-circulations required. • 1st pass of ASIC for VXLAN de-capsulation • 2nd pass of ASIC to route of inner frame
• 3rd pass of ASIC for VXLAN encapsulation
§ Single re-circulation required. • 1st pass of ASIC for VXLAN de-
capsulation • 2nd pass of ASIC to route of inner frame
VLAN 10 ß VLAN 20ß VNI 1020
VLAN 10
VNI 1010 à VLAN 10 à VLAN 20 à VNI 1020
VNI 1010
VNI 1020
VNI 1020
VNI 1020
VXLAN Routing – Forwarding Models
Intel Fulcrum Alta platforms • All VXLAN routing functionality is achieved in a single pass • No need for recirculation ports
Broadcom Trident2, Tomahawk • All VXLAN routing functionality is achieved in mixed single and double passes • Need for recirculation ports
Broadcom Trident2+, ARAD, Jericho platforms • All VXLAN routing functionality is achieved in a single pass • No need for recirculation ports
Summary VXLAN
§ Open standard RFC 7348 – multivendor support on software or hardware § L2 extension over L3 network
• More reliable & scalable than L2 only QinQ, TRILL and PBB
§ L2 over L3 services using switching TCO vs router MPLS TCO § VXLAN VTEP at host, VM, spine/leaf switches, load balancer – flexibility for
users and service providers § Preference on hardware based VXLAN - performance § Use cases
• L2 extension over L3 routing network. MPLS not needed.
• Data Center Interconnect (DCI) for active-active DC • Multi tender services chaining in hosted DC
Thank-you
Frankie Lim @ Arista.com