vlans in the linux kernel

33
VLANs in Linux kernel how simple things might get quite complicated Jiří Pírko <[email protected]>

Upload: kernel-tlv

Post on 14-Jan-2017

1.028 views

Category:

Software


21 download

TRANSCRIPT

Page 1: VLANs in the Linux Kernel

VLANs in Linux kernelhow simple things might get quite complicated

Jiří Pírko <[email protected]>

Page 2: VLANs in the Linux Kernel

Who am I?● A Linux kernel developer/network developer● First patch accepted to Linux kernel in October 2008 - book name fix in documentation :-)● Author of a bonding driver replacement - team driver and libteam (http://libteam.org)● Founder of automated and portable network testing framework called LNST (http://lnst-project.org)● Started a “true open switch” initiative called switchdev● Co-author of rocker qemu switch implementation and rocker driver● Co-author of mlxsw - driver for Mellanox SwitchX-2 and Spectrum ASICs

Page 3: VLANs in the Linux Kernel

VLAN use-case - problem

Coca-Cola

Port 1 Port 2 Port 3 Port 4

Port 1 Port 2 Port 3 Port 4

Pepsi

Coca-Cola Pepsi

Page 4: VLANs in the Linux Kernel

VLAN use-case - solution

Coca-Cola

Port 1 Port 2 Port 3 Port 4

Port 1 Port 2 Port 3 Port 4

Pepsi

Coca-Cola Pepsi

VLAN ID 100 - Coca-ColaVLAN ID 200 - Pepsi

Page 5: VLANs in the Linux Kernel

802.1Q VLAN packets

Destination MAC Source MAC EtherType/Size Payload

Destination MAC Source MAC EtherType/Size Payload802.1Q

header

12 bits

TCI

PCP DEI VIDTPID

16 bits 3 bits 1 bit

Packet format: 802.1Q header format:

● TPID (Tag protocol identifier): In the same position as EtherType/Size. It is set to value of 0x8100 - by that you can identify 802.1Q tagged packet and distinguish from untagged packets

● TCI (Tag control information)○ PCP (Priority code point): Priority according to 802.1p, 7 is highest. Used for QoS○ DEI (Drop eligible indicator): Formerly CFI. Indicates is packet is suitable for being dropped in case of congestion○ VID (VLAN identifier): Specifies the VLAN to which the packet belongs. Values are in range 0-4094. Value 0 has a

special meaning, indicates that the packet does not belong to any VLAN. The purpose of that is to allow to use PCP for non-VLAN packets

Page 6: VLANs in the Linux Kernel

Used terms and colors● struct net_device *dev

○ Referred to as dev, skb->dev○ One instance for each network device

● struct sk_buff *skb○ Referred to as skb○ One instance for every incoming and outgoing packet

● struct net_device_ops *ops○ Referred to as ops, dev->ops, ndos (net_device ops)○ Set of callbacks that each driver defines for core to call

● Vlan data path - red● Vlan accelerated data path - pink

Page 7: VLANs in the Linux Kernel

VLAN userspace interfaces in Linux kernel● Ioctl-based

○ Introduced along with the initial VLAN implementation in 2002○ Userspace tool is called vconfig:

# vconfig add eth0 100Added VLAN with VID == 100 to IF -:eth0:-# ip address add 192.168.0.1/24 dev eth0.100

● Netlink-based○ Introduced by following commit:

commit 07b5b17e157b7018d0ca40ca0d1581a23096fb45Author: Patrick McHardy <[email protected]>Date: Wed Jun 13 12:07:54 2007 -0700

[VLAN]: Use rtnl_link API

○ Extends use of ip tool (a part of iproute2 package):# ip link add link eth0 name eth0.100 type vlan id 100 # ip address add 192.168.0.1/24 dev eth0.100

Page 8: VLANs in the Linux Kernel

Simplified RX path of packet in Linux kernel

Network core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

bridge, bonding, team,macvlan, openvswitch, ...

NIC driver(eth0)RX ring buffer desc

create skb

RX queue enqueue

RX queue dequeue

packet type “all” taps

hooks (rx_handler)

packet type handlers

Page 9: VLANs in the Linux Kernel

Simplified TX path of packet in Linux kernelNetwork core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

NIC driver(eth0)

TX ring buffer desc

create skb

dev_queue_xmit()

dev_queue_xmit_nit()

ndo_start_xmit

enqueue/schedule

Page 10: VLANs in the Linux Kernel

Initial VLAN implementation● Merged in February 2002● Author: Ben Greear <[email protected]>● One net_device per VID

○ eth0 - real device○ eth0.100 - vlan device for VID 100○ eth0.200 - vlan device for VID 200

● On RX:○ Hook on ETH_P_8021Q (0x8100) packet type with dev_add_pack()○ Lookup the vlan net_device and adjust skb->dev accordingly○ Reinject to RX path

● On TX:○ Implement ops->ndo_start_xmit (was dev->hard_start_xmit at that time)○ Get real device and set it to skb->dev○ Reinject to TX path

Page 11: VLANs in the Linux Kernel

Initial VLAN implementation - RX pathVlan code(eth0.100)

Network core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

bridge, bonding, team,macvlan, openvswitch, ...

NIC driver(eth0)RX ring buffer desc

create skb

RX queue enqueue

RX queue dequeue

packet type “all” taps

hooks (rx_handler)

packet type handlers

Pop vlan headerChange skb->dev to vlan devReinject

type 0x8100 (802.1Q)

Page 12: VLANs in the Linux Kernel

Initial VLAN implementation - TX pathNetwork core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

NIC driver(eth0)

TX ring buffer desc

create skb

dev_queue_xmit()

dev_queue_xmit_nit()

ndo_start_xmit

enqueue/schedule

Vlan code(eth0.100)

ndo_start_xmit

Push vlan headerChange skb->dev to real devReinject

Page 13: VLANs in the Linux Kernel

VLAN tagging/stripping HW acceleration● Merged in March 2002● Author: David S. Miller <[email protected]>● Went in together with significant code change ● NIC does vlan header pop and push in HW● On RX:

○ Driver gets the info about vlan tagging from HW○ Injects the packet in the RX path differently. It uses vlan_hwaccel_rx and function

● On TX:○ During vlan device create, accelerated path is selected if the real device has

NETIF_F_HW_VLAN_TX feature on○ Vlan code puts TCI info including VID into skb->cb cookie, sets skb->dev to real device. Later this is

moved from skb->cb to dedicated skb->vlan_tci.○ Reinject to TX path○ Driver get the info by vlan_tx_tag_get() and passes this info to HW along with the packet

Page 14: VLANs in the Linux Kernel

VLAN tagging/stripping HW acceleration - RX pathVlan code(eth0.100)

Network core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

bridge, bonding, team,macvlan, openvswitch, ...

NIC driver(eth0)RX ring buffer desc

create skb

RX queue enqueue

RX queue dequeue

packet type “all” taps

hooks (rx_handler)

packet type handlers

Pop vlan headerChange skb->dev to vlan devReinject

type 0x8100 (802.1Q)

vlan hwaccel RX

Page 15: VLANs in the Linux Kernel

VLAN tagging/stripping HW acceleration - TX pathNetwork core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

NIC driver(eth0)

TX ring buffer desc

create skb

dev_queue_xmit()

dev_queue_xmit_nit()

ndo_start_xmit

enqueue/schedule

Vlan code(eth0.100)

ndo_start_xmit

Push vlan headerChange skb->dev to real devReinject

ndo_start_hwaccel_xmit

Set vlan skb->cb cookieChange skb->dev to real devReinject

Page 16: VLANs in the Linux Kernel

VLAN filtering offload● Merged in March 2002● Author: David S. Miller <[email protected]>● Unknown vlan packets are filtered-out in HW● Driver advertises filtering abilities with NETIF_F_HW_VLAN_FILTER feature bit● Driver implements vlan_rx_register, vlan_rx_add_vid and vlan_rx_kill_vid ops

○ vlan_rx_register pushes down struct vlan_group which is internal to vlan code. This turned out to be quite pointless but was spread to lot of drivers.

Page 17: VLANs in the Linux Kernel

VLAN story is starting to get a bit sad● In the time, GRO support was added● Lot of functions drivers may call under various circumstances to get vlan packet down to networking core

○ __vlan_hwaccel_rx○ vlan_gro_receive○ vlan_gro_frags

● vlan_hwaccel_do_receive() that sets skb->dev is splitted out from __vlan_hwaccel_rx():commit 9b22ea560957de1484e6b3e8538f7eef202e3596Author: Patrick McHardy <[email protected]>Date: Tue Nov 4 14:49:57 2008 -0800

net: fix packet socket delivery in rx irq handler

The changes to deliver hardware accelerated VLAN packets to packetsockets (commit bc1d0411) caused a warning for non-NAPI drivers.The __vlan_hwaccel_rx() function is called directly from the driversRX function, for non-NAPI drivers that means its still in RX IRQContext.

....

● Bonding gets in the way. More later on.

Page 18: VLANs in the Linux Kernel

VLAN model centralization● Let the driver set skb->vlan_tci using __vlan_hwaccel_put_tag() and push packet down to a networking core

in the same way as non-vlan packets● The vlan handling code is called from the middle of RX processing (after packet type all taps)● Patchset finishes with patch:

commit 3701e51382a026cba10c60b03efabe534fba4ca4Author: Jesse Gross <[email protected]>Date: Wed Oct 20 13:56:06 2010 +0000

vlan: Centralize handling of hardware acceleration.

Currently each driver that is capable of vlan hardware accelerationmust be aware of the vlan groups that are configured and then passthe stripped tag to a specialized receive function. This isdifferent from other types of hardware offload in that it places asignificant amount of knowledge in the driver itself rather keepingit in the networking core.

....

Page 19: VLANs in the Linux Kernel

VLAN model centralization - RX pathVlan code(eth0.100)

Network core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

bridge, bonding, team,macvlan, openvswitch, ...

NIC driver(eth0)RX ring buffer desc

create skb

RX queue enqueue

RX queue dequeue

packet type “all” taps

hooks (rx_handler)

packet type handlers

Pop vlan headerChange skb->dev to vlan devReinject

type 0x8100 (802.1Q)

fill-up skb->vlan_tci

Process skb->vlan_tciReinject

Page 20: VLANs in the Linux Kernel

VLAN model centralization - TX pathNetwork core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

NIC driver(eth0)

TX ring buffer desc

create skb

dev_queue_xmit()

dev_queue_xmit_nit()

ndo_start_xmit

enqueue/schedule

Vlan code(eth0.100)

ndo_start_xmit

Set skb->vlan_tciChange skb->dev to real devReinject

Check if dev supports vlan accel, if not, push header

Page 21: VLANs in the Linux Kernel

Accel and non-accel unification● For RX path only, as TX part was taken care of in “centralization” patchset● The idea is to “emulate” VLAN HW acceleration● Untag VLAN header for non-accelerated path early in network core and set skb->vlan_tci. Let the rest of the

processing be same as for accelerated path.commit bcc6d47903612c3861201cc3a866fb604f26b8b2Author: Jiri Pirko <[email protected]>Date: Thu Apr 7 19:48:33 2011 +0000

net: vlan: make non-hw-accel rx path similar to hw-accel

Now there are 2 paths for rx vlan frames. When rx-vlan-hw-accel isenabled, skb is untagged by NIC, vlan_tci is set and the skb gets intovlan code in __netif_receive_skb - vlan_hwaccel_do_receive.

For non-rx-vlan-hw-accel however, tagged skb goes thru whole__netif_receive_skb, it's untagged in ptype_base hander and reinjected

This incosistency is fixed by this patch. Vlan untagging happens early in__netif_receive_skb so the rest of code (ptype_all handlers, rx_handlers)see the skb like it was untagged by hw.

Page 22: VLANs in the Linux Kernel

Accel and non-accel unification - RX pathVlan code(eth0.100)

Network core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

bridge, bonding, team,macvlan, openvswitch, ...

NIC driver(eth0)RX ring buffer desc

create skb

RX queue enqueue

RX queue dequeue

packet type “all” taps

hooks (rx_handler)

packet type handlers

Pop vlan headerSet skb->vlan_tci

fill-up skb->vlan_tci

Process skb->vlan_tciReinject

Page 23: VLANs in the Linux Kernel

Stacked network devices● Also called master-slave devices or upper-lower devices● Bonding, Bridge, Team, Macvlan, Open vSwitch, …● Master device is attached to slave device

○ On RX, master attaches rx_handler on slave and steals incoming packets○ On TX, master calls dev_queue_xmit() of slave

● Forms a hierarchy, an example:

eth0 eth1 eth2

bond0

br0 192.168.0.1/24

Page 24: VLANs in the Linux Kernel

VLAN issues in combination with stacked devices● Ordering for RX

○ Vlan device gets bigger priority over master device?○ Master device gets bigger priority over vlan device?○ More on next slide

● Vlan filter○ Master has to propagate down ndo_vlan_rx_add_vid/ndo_vlan_rx_kill_vid○ Master has to replay filter setup if add_vid was called before enslavement

Page 25: VLANs in the Linux Kernel

Stacked device with VLAN ordering fix● For RX path only● Changes the order so the vlan hook is called before rx_handler

commit 2425717b27eb92b175335ca4ff0bb218cbe0cb64Author: John Fastabend <[email protected]>Date: Mon Oct 10 09:16:41 2011 +0000

net: allow vlan traffic to be received under bond

The following configuration used to work as I expected. At leastwe could use the fcoe interfaces to do MPIO and the bond0 ifaceto do load balancing or failover.

....This worked because of a change we added to allow inactive slavesto rx 'exact' matches. This functionality was kept intact with therx_handler mechanism. However now the vlan interface attached to theactive slave never receives traffic because the bonding rx_handlerupdates the skb->dev and goto's another_round. Previously, thevlan_do_receive() logic was called before the bonding rx_handler.

....

Page 26: VLANs in the Linux Kernel

Stacked device with VLAN ordering fix - RX pathVlan code(eth0.100)

Network core

ARP, IPv4, IPv6, ...

Packet socket (e.g. tcpdump)

bridge, bonding, team,macvlan, openvswitch, ...

NIC driver(eth0)RX ring buffer desc

create skb

RX queue enqueue

RX queue dequeue

packet type “all” taps

hooks (rx_handler)

packet type handlers

Pop vlan headerSet skb->vlan_tci

fill-up skb->vlan_tci

Process skb->vlan_tciReinject

Page 27: VLANs in the Linux Kernel

VLAN Linux kernel implementation summary● 14 years of development● Over 500 commits● Over 3500 lines of code (net/8021q/, include/linux/if_vlan.h)● Lots of upset end-users and developers

Page 28: VLANs in the Linux Kernel

Alternative VLAN implementation - in Linux bridge● Merged in February 2013● Author: Vlad Yasevich <[email protected]>● Implements vlan filtering in bridge● Simple example that allows packets with VID 100 to be forwarded between eth0 and eth1:

# ip link add name br0 type bridge# ip link set dev br0 type bridge vlan_filtering 1# ip link set eth0 master br0# ip link set eth1 master br0# bridge vlan add vid 100 dev eth0# bridge vlan add vid 100 dev eth1# bridge vlan show dev eth0port vlan idseth0 1 PVID Egress Untagged 100

● To set PVID and Egress Untagged:# bridge vlan add vid 100 dev eth0 untagged# bridge vlan add vid 100 dev eth0 pvid

Page 29: VLANs in the Linux Kernel

Alternative VLAN implementation - in Open vSwitch● OVS is an OpenFlow motivated switch implementation● Vlan support merged in October 2011 as a part of Open vSwitch kernel datapath introduction:

commit ccb1352e76cff0524e7ccb2074826a092dd13016Author: Jesse Gross <[email protected]>Date: Tue Oct 25 19:26:31 2011 -0700

net: Add Open vSwitch kernel components.

● There is possible to add flows that match packets based on the VID - “vlan flow key”● There is vlan POP and vlan PUSH action that can be chained to the flow match

recirc_id(0),in_port(2),eth(src=e4:1d:2d:a5:f3:9d,dst=e4:11:22:33:44:52),eth_type(0x8100), \vlan(vid=53,pcp=0),encap(eth_type(0x0800),ipv4(frag=no)), packets:34, bytes:3468, used:0.260s, actions:pop_vlan,5recirc_id(0),in_port(5),eth(src=e4:11:22:33:44:52,dst=e4:1d:2d:a5:f3:9d),eth_type(0x0800), \ipv4(frag=no), packets:35, bytes:3438, used:0.260s, actions:push_vlan(vid=53,pcp=0),2

● There is some of the code used from the vlan code, some of the code is implemented on top● Fixed by:

commit 93515d53b133d66f01aec7b231fa3e40e3d2fd9aAuthor: Jiri Pirko <[email protected]>Date: Wed Nov 19 14:05:02 2014 +0100

net: move vlan pop/push functions into common code

Page 30: VLANs in the Linux Kernel

Alternative VLAN implementation - in TC● Implemented as a part of Classifier-Action subsystem of TC (traffic control)

○ Classifiers are used to match on packets: cls_u32, cls_flower, cls_bpf, many others○ Actions are executed on a successfully matched packet: act_gact, act_mirred, act_skbedit, act_bpf○ Nice presentation about TC CA from Netdev 0.1: https://www.netdev01.org/sessions/21

● act_vlan was added to allow push and pop vlan header:commit c7e2b9689ef81362a8091592da6cb6a7723f377aAuthor: Jiri Pirko <[email protected]>Date: Wed Nov 19 14:05:03 2014 +0100

sched: introduce vlan action

● Simple example:# tc filter add dev eth0 parent ffff: protocol all u32 match u32 0 0 \ action vlan push id 100 \ action mirred egress redirect dev eth1# tc filter add dev eth1 parent ffff: protocol all u32 match u32 0 0 \ action vlan pop \ action mirred egress redirect dev eth0

● There is a plan to extend cls_flower to allow to match on vlan headers

Page 31: VLANs in the Linux Kernel

Alternative VLAN implementation - in BPF● BPF - Berkeley Packet Filter

○ Implemented as a VM with specific instruction set and set of registers○ Kernel would interpret the BPF program inserted by user○ Originally served for a filter program to be attached on a socket, now used as “universal in-kernel VM”○ JIT support for many CPU architectures○ Extension is called eBPF - more registers, added maps, etc.

● Vlan header info getter and header push and pop support introduced by:commit c24973957975403521ca76a776c2dfd12fbe9addAuthor: Alexei Starovoitov <[email protected]>Date: Mon Mar 16 18:06:02 2015 -0700

bpf: allow BPF programs access 'protocol' and 'vlan_tci' fields

commit 4e10df9a60d96ced321dd2af71da558c6b750078Author: Alexei Starovoitov <[email protected]>Date: Mon Jul 20 20:34:18 2015 -0700

bpf: introduce bpf_skb_vlan_push/pop() helpers

Page 32: VLANs in the Linux Kernel

BPF usage for networking purposes● TC clsact support added to iproute2 by:

commit 8f9afdd531560c1534be44424669add2e19deeecAuthor: Daniel Borkmann <[email protected]>Date: Tue Jan 12 01:42:20 2016 +0100

tc, clsact: add clsact frontend

Add the tc part for the kernel commit 1f211a1b929c ("net, sched: addclsact qdisc"). Quoting example usage from that commit description:

Example, adding qdisc: # tc qdisc add dev foo clsact # tc qdisc show dev foo qdisc mq 0: root qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 Adding filters (deleting, etc works analogous by specifying ingress/egress): # tc filter add dev foo ingress bpf da obj bar.o sec ingress # tc filter add dev foo egress bpf da obj bar.o sec egress # tc filter show dev foo ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action # tc filter show dev foo egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

The ingress parent alias can also be used with ingress qdisc.

bpf: introduce bpf_skb_vlan_push/pop() helpers

Page 33: VLANs in the Linux Kernel

Questions?

Link to slides: