scaling bridge forwarding database - linux...

31
Scaling bridge forwarding database Roopa Prabhu, Nikolay Aleksandrov

Upload: others

Post on 02-Jun-2020

28 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

Scaling bridge forwarding database

Roopa Prabhu, Nikolay Aleksandrov

Page 2: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

2

● Linux bridge forwarding database: quick overview

● Linux bridge deployments at scale: focus on multihoming

● Scaling bridge database: challenges and solutions

Agenda

Page 3: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

3

Bridge fdb entries

bridge

H1 <M1>

FDB

<M1> dev swp1 vlan 10 <M2> dev swp2 vlan 10

H2 <M2>

swp1 swp2

• Flood and learn (most basic case)

• End point Orchestrator/provisioning controller based FDB programming

• Control plane learning:▪ Local or distributed

• [<Mac> <vlan> <dst_port>]

switch1

Page 4: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

4

Bridge fdb entries: network virtualization (overlay: eg vxlan)

● Overlay macs point to overlay termination end-points● Eg Vxlan tunnel termination endpoints (VTEPS)

○ Vxlan fdb extends bridge fdb○ Vxlan fdb carries remote dst info○ [ <mac> <vni> <remote_dst list> ]

■ Where remote_dst_list = remote overlay endpoint ip’s■ Pkt is replicated to list of remote_dsts

Page 5: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

5

switch2switch1 Vxlan FDB<M3> vxlan-10 dst 27.0.0.8<M4> dev vxlan-10 dst 27.0.0.8

Bridge fdb entries: overlay example

27.0.0.7 bridge

H1 <M1>

vxlan-10

FDB<M1> dev swp1 vlan 10 <M2> dev swp2 vlan 10<M3> dev vxlan-10 vlan 10 <M4> dev vxlan-10 vlan 10

H2 <M2>

swp1 swp2

H1 <M3>

vxlan-10

FDB<M3> dev swp1 vlan 10 <M4> dev swp2 vlan 10<M1> dev vxlan-10 vlan 10 <M2> dev vxlan-10 vlan 10

Vxlan FDB<M1> vxlan-10 dst 27.0.0.7<M2> dev vxlan-10 dst 27.0.0.7

H2 <M4>

swp1 swp2

Vxlan Overlay

27.0.0.8 bridge

● switch1: M1 and M2 are local macs. M3 and M4 are remote macs

Page 6: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

6

Bridge fdb database scale

Page 7: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

7

Bridging scale on a data center switch

• layer-2 gateway• Bridging accelerated by hardware

▪ Learning in hardware▪ Flooding in hardware and software

• IGMP snooping + optimized multicast forwarding

• Bridging larger L2 domains with overlays (eg vxlan)

• Multihoming: Bridging with distributed state

Page 8: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

8

Layer-2 gateway in a datacenter architecture

SPINE

LEAF (TOR) Layer2-3 boundary

Layer-2 gateway

Page 9: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

9

Bridge fdb performance parameters at scale

• Learning• Adding, deleting and updating fdb entries• Reduce flooding• Optimized Broadcast-Multicast-Unknown unicast

handling• Convergence and failure Handling

Page 10: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

10

Multihoming

Page 11: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

11

Multihoming

• Multihoming is the practice of connecting host or a network to more than one network (device)▪ To increase reliability and performance

• For the purpose of this discussion, let’s just say its a “Cluster of switches running Linux” providing redundancy to hosts

Page 12: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

12

Common functions of a multihoming solution

• Provide redundant paths to multihomed end-points • Faster network convergence in event of failures:

▪ Establish alternate redundant paths and move to them faster

• Distributed state:▪ Reduce flooding of unknown unicast, broadcast and

multicast traffic regardless of which switch is active:• By keeping forwarding database in sync between peers• By Keeping multicast forwarding database in sync

between peers

Page 13: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

13

Multihoming: dedicated link

switch1 switch2

Host1 Host2

peerlink● Dedicated physical link

(peerlink) between switches to sync multihoming state

● Hosts are connected to both switches

Page 14: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

14

Multihoming: bridge: dedicated link

switch1

bridge

swp1 swp2

peerlink

H1 M1bond0

eth0 eth1

● Peerlink is a bridge port

● Fdb entries to host point to host port <M1> dev swp1

● Fdb entry on swp1 failure, moved to peerlink: <M1> dev peerlink

switch2

bridge

swp1 swp2

peerlink

H2 M2bond0

eth0 eth1

Page 15: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

15

Network convergence during failures

• Multihoming Control plane reprogrames the fdb database:▪ Update fdb entries to point to peer switch link▪ Uses bridge fdb replace▪ Restore when network failure is fixed

• Problems:▪ Too many fdb updates and netlink notifications▪ Affects convergence

Page 16: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

16

Bridge port backup port

• For Faster network convergence:▪ peer link is the static backup port for all host

bridge ports▪ Make peer link the backup port at config time:

• bridge seamlessly redirects traffic to backup port

▪ Patch [1] does just that:

Page 17: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

17

Per Bridge backup port [1]

Before:

$bridge fdb show

mac1 dev swp1

/* On swp1 link failure event, control plane updates each fdb entry to point to peerlink */

$bridge fdb show

mac1 dev peerlink

After:

Bridge port swp1 has peerlink as backup port:

$ip link set dev swp1 type bridge_slave backup_port peerlink

$bridge fdb show

mac1 dev swp1

/* On swp1 link failure event, kernel implicitly forwards traffic to backup port peerlink. No change to fdb entry */

$bridge fdb show

mac1 dev swp1

Page 18: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

18

Future enhancements

Debuggability:

• Fdb dumps to carry indication that backup port is active

Page 19: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

19

Multihoming: network overlay

switch0 switch1

host1 host2

switch2

host2

overlay overlay

Page 20: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

20

Multihoming with network virtualization

• No dedicated link between the clustered switches in a multihomed environment

• Dedicated switch peer-link is now replaced by the overlay• Eg a vxlan tunnel port in a vxlan environment

• More than 2 switches in a cluster • In the active-active case, more than one remote dst in the

underlay:• mac <remote-end-point-underlay-ip-list>• Requires mac ECMP

Page 21: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

21

Multihoming: network overlay

switch0

overlay

overlay

switch1

bridge

swp1 swp2

vxlan0

H1 M1bond0

eth0 eth1

switch0

switch2

bridge

swp1 swp2

vxlan0

H2 M2bond0

eth0 eth1

switch0

switch3

bridge

swp1 swp2

vxlan0

H3 M3bond0

eth0 eth1

swp3

Page 22: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

22

Control plane strategies for faster convergence

• Designated forwarder: avoid duplicating pkts [2,3]• Split horizon checks [4]• Aliasing: Instead of distributing all macs and

withdrawing during failures infer from membership advertisements [5]

Page 23: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

23

Forwarding database changes for faster convergence

• Backup port: to redirect traffic to network overlay on failure [1]

• Mac dst groups: ▪ where dst is an overlay end-point▪ Allow faster updates to mac dst groups (next slide)

Page 24: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

24

MAC dst groups

• At this scale, we start thinking of MAC’s as Routes• Mac points to dst group• Dst groups can be ECMP or replication groups• Ability to update macs and dst groups separately is a

huge win▪ Similar to recent updates to the routing API [6]

Page 25: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

25

Current vxlan fwding database

Eg: Vxlan fdb entry:

New proposed vxlan fwding database

Eg: Vxlan fdb entry:

Dst group db:

remote vni, remote_ip

dst_grp (id)

remote vni, remote_ip

remote vni, remote_ip

New way to look at overlay FDB entry: dst groups

remote vni, remote_ip

mac, vni

remote vni, remote_ip

remote vni, remote_ip

mac, vni dst_grp_id

Page 26: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

26

Fdb database API update

New fdb netlink attribute to link an fdb entry to a dst group:

• NDA_DST_GRP

Page 27: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

27

New dst group API

To create/delete/update a dst group: RTM_NEW_DSTGRP/RTM_DEL_DSTGRP/RTM_GET_DSTGRPenum {

NDA_DST_GROUP_UNSPEC,

NDA_DST_GROUP_ID,

NDA_DST_GROUP_FLAGS,

NDA_DST_GROUP_ENTRY,

__NDA_DST_GROUP_MAX,

};

#define NDA_DST_GROUP_MAX (__NDA_DST_GROUP_MAX - 1)

enum {

NDA_DST_UNSPEC,

NDA_DST_IP,

NDA_DST_IFINDEX,

NDA_DST_VNI,

NDA_DST_PORT,

__NDA_DST_MAX,

}

#define NDA_DST_MAX (__NDA_DST_MAX - 1)

#define NTF_DST_GROUP_REPLICATION 0x01

#define NTF_DST_GROUP_ECMP 0x02

Page 28: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

28

Other considerations for the dstgrp api

• Investigating possible re-use of route nexthop API [6]

Page 29: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

29

Acknowledgements

We would like to thank Wilson Kok, Anuradha Karuppiah, Vivek Venkataraman and Balki Ramakrishnan for discussion, knowledge and requirements for building better Multihoming solutions on Linux.

Page 30: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

30

References

[1] net: bridge: add support for backup port: https://patchwork.ozlabs.org/cover/947461/

[2] E-VPN Multihoming: https://tools.ietf.org/html/rfc7432#section-8

[3] E-VPN Multihoming: Fast convergence: https://tools.ietf.org/html/rfc7432#section-8.2

[4] E-VPN multihoming split horizon: https://tools.ietf.org/html/rfc7432#section-8.3

[5] E-VPN Aliasing and Backup Path: https://tools.ietf.org/html/rfc7432#section-8.4

[6] Nexthop groups: https://lwn.net/Articles/763950/

Page 31: Scaling bridge forwarding database - Linux kernelvger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV2.pdf5 switch2 switch1 Vxlan FDB  vxlan-10 dst 27.0.0.8

31

Thank you