supercharged planetlab platform, control overview

WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

[email protected]

Supercharged PlanetLab Platform, Control Overview

Fred [email protected]

Applied Research Laboratory

Washington University in St. Louis

2WashingtonWASHINGTON UNIVERSITY IN ST LOUIS

Fred Kuhns - 04/22/23

GPE

NPE

GPELC

RTM

Sw

itch

Prototype Organization

• One NP blade (with RTM) implements Line Card– separate ingress/egress pipelines

• Second NP hosts multiple slice fast-paths– multiple static code options for diverse slices

– configurable filters and queues

• GPEs run standard Planetlab OS with vServers

Lookup(1 ME)

Rx(1 ME)

Tx(1 ME)

QueueManager(2 ME)

KeyExtract(1 ME)

HdrFormat(1 ME)

TCAM

DRAM

SRAMSRAM

Lookup(2 ME)

ExtRx(2 ME)

IntTx(2 ME)

QueueManager(2 ME)

KeyExtract(2 ME)

HdrFormat(1 ME)

Lookup(2 ME)

KeyExtract(1 ME)

IntRx(2 ME)

ExtTx(2 ME)

QueueManager(2 ME)

Rate Monitor(1 ME)

HdrFormat(1 ME)

TCAM

DRAM

SRAM SRAMSRAM

DRAM

SRAMSRAM

exte

rnal in

terf

ace

swit

ch in

terf

ace

ingress sideegress side



Connecting an SPP

sw

gpe gpe npe npe

LC(s)SPP

East CoastWest CoastLocal/Regional

Ethernet SW

point-to-point point-to-pointR R

plab/SPP

plab/SPP

plab/SPP

CP

Host

ARP: endstations and intermediate routers

Host



…

NPE

NP

U-A

NP

U-B

xscale xscale

SPI

TC

AM

PCI

GE

SCD

System Block Diagram

… …

Base Ethernet Switch (1Gbps, control)

NPE N

PU

-A

NP

U-B

xscale xscale

SPI

NPE GPE

10 x 1GbERTM

TC

AM …

LC

RTM

GPE

interfaces

PCI

Standalone GPEs

Shelf manager

I2C(IPMI)Control Processor (CP)

Fabric Ethernet Switch (10Gbps, data path)

GE

SPP Node

External Interfaces

vnet

Substrate Control Daemon (SCD)Boot and Configuration Manager (BCM)

SCDLNM

LRM

user

sliv

ers

Hub

pl_n

etfl

ow

tftp,dhcpd

ARP TableFIB

flow stats(netflow)

System Resource Manager (SRM)

sshd*System Node Manager (SNM) httpd*

user info

flow stats

BCM

boot files

routed*

route DBResource DB

Slivers DB

nodeconf.xml

NAT & Tunnelfilters (in/out)



Software Components• Utilities: parts of BCM to generate config and distribution files

– Node configuration and management: generate config files, dhcp, tftp, ramdisk– Boot CD and distribution file management (images, RPM and tar files) for GPEs and CP.

• Control processor:– Boot and Configuration Manager (BCM)– System Resource Manager (SRM)– System Node Manager (SNM)– user authentication and ssh forwarding daemon– http daemon providing a node specific interface to netflow data (planetflow)– Routing protocol daemon (BGP/OSPF/RIP) for maintaining FIB in Line Card

• General Purpose Element (GPE)– Local Boot Manager (LBM): Modified BootManager running on the GPEs– Local Resource Manager (LRM)– Local Node Manager (LNM), that is the required changes to existing Node Manager software.

• Network Processor Element (NPE)– Substrate Control Daemon (SCD, formally known as wuserv)– kernel module to read/write memory locations (wumod)– Command interpreter for configuring NPU memory (wucmd)– Modified Radisys and Intel source; ramdisk; Linux kernel

• Line Card– ARP: protocol and error notifications. Lookup table entries have either the NH IP or an ethernet address

• Sliver packets which can not be mapped to an Ehternet address must receive error notifications.– netflow-like stat collection and reporting to CP for display on web and downloading by PLC.– FIB in lookup table maintained by the SRM– NAT lookup entries for unregistered traffic originating from GPE or CP



Boot and Configuration management• Read config file and allocate IP subnets and addresses for substrate • Initialize Hub (delegate to SRM)

– base and fabric switches– Initialize any switches not within the chassis

• Create dhcp configuration file and start daemon– assigns control IP subnets and addresses– assigns internal substrate IP subnet on fabric Ethernet

• Initialize Line Card to forward all traffic to CP– Use the control interface, base or front panel (Base only connected to NPUA).– All ingress traffic sent to CP– What about Egress traffic when we are multi-homed, either through different physical

ports or one port with more than one next hop? • We could assume only one physical port and one next hop.• This is a general issue, the general solution is to run routing protocols on the CP and keep the

line card’s TCAM up to date.

• Start remaining system level services (i.e. daemons)– wuarl daemons– system daemons: sshd*, httpd, routed*

• System Node Manager maintains user login information for ssh forwarding



Boot and Configuration management• Assist GPE in booting:

– Download from PLC SPP specific version of the BootManager and NodeManager tar/rpm distributions.

– Downloads/maintains Planetlab bootstrap distribution• Updated BootCD

– The boot CD contains SPP config file with CP address, spp_config.– No modifications to initial boot scripts, they contact the BCM over the

fabric interface (using the substrate IP subnet) and download the next stage.

• GPEs obtain distribution files from the BCM on the CP:– SPP changes are confined to the BootManager and NodeManage

sources (that is the plan)– PLC Database updated to place all SPP nodes in the “SPP” Node Group,

we use this to trigger additional “special” processing.– Modified BootManager scripts configure control interfaces (Base) and 2

Fabric interfaces (2 per Hub). – Creates/Updates spp_config file on GPE node– Installs BootStrap source then overwrites the NodeManager with our

modified version.



external interface to fabric and base (additional GPEs)

xxxx

x

Default Traffic Configurations

LC

Substrate

PE NPE GPE

mux

LRMMP

SNM

CP

user login info SRM

Resource DB

…LNM

1GbE (base, control)

PLC

planetlab OSroot context

10GbE (fabric, data)5 6

4 3 2 1

Default: traffic forwarded to CP over 10Gbps Ethernet

switch (aka fabric)

x

Control messages sent over an isolated base Ethernet switch.

For isolation and securityLine card performs NAT like function for traffic

from vservers.

sliver tbl



Logging Into a Slice

xxxx

x

LC

Substrate

Host(located within node)

PE NPE GPE

mux

LRMMP

SNM

CP

user login info SRM

Resource DB

…LNM


PLC



4 3 2 1

x

ssh connection directed to CP for user authentication

Once authenticated, session forwarded to appropriate

GPE and vserver.

ssh fwder

sliver tbl



System Node Manager• Logically the top-half of the PlanetLab Node Manager • PLC API method GetSlivers():

– periodically call PLC for current list of slices assigned to this node– assign system slivers to each GPE, then split application slivers across available GPEs– keep persistent tables to handle daemon crashes or local device reboots

• Local GetSlivers() (xmlrpc interface) to GPEs– Local node managers (per GPE) list of allocated slivers along with other node specific data

{timestamp, list of configuration files, node id, node groups, network addresses, assigned slivers}• Resource management across GPEs

– Manage Pool and VM RSpec assiSNMent for each GPE:•opportunity to extend RSpecs to account for distributed resources.

– Perform ‘top-half’ processing of the per GPE LNM api (exported to sliver on this only). Calls on one GPE may impact resource assiSNMents or sliver status on a different GPE:

{Ticket(), GetXIDs(), GetSSHKeys(), Create(), Destroy(), Start(), Stop(), GetEffectiveRSpec(), GetRSpec(), GetLoans(), validate_loans(), SetLoans()}

• Currently the node manager uses CA Certs and SSH keys when communicating with PLC, we will need to do the same. But we can relax security between SNM and the LNMs.

• Tightly coupled with the System Resource Manager– Maintain a globally unique (to the node) Sliver ID which corresponds to what we call the meta-router ID and

make available to SRM when enabling fast path processing (VLANs, UDP Port numbers etc).– must request/maintain list of available GPEs and resource availability on each. Used for allocating sliver’s to

GPEs and handling RSpecs.– SRM may delegate GPE management to SNM.



SNM: Questions

• Robustness -- not contemplating for this version– If a GPE goes down do we migrate slivers to remaining

GPEs?

– If a GPE is added do we migrate some slivers to new GPE to load balance?

• Do we need to intercept any of the API calls made against the PLC?

• What about the boot manager api calls and the uploading of boot log files (alpina boot logs)?

• implementation of the remote reboot command and console logging.



Local Node Manager

• “Bottom-Half” of existing Node Manager• modify GetSliver() to call the System Node

Manager.– use base interface and different security (currently they

wrap xmlrpc calls with a curl command which includethe PLC’s certified public key).

• Forward GPE oriented sliver resource operations to SNM: see API list in SNM description



Update Local Slice Definitions

x

xxxx

x

LC

Substrate


PE NPE GPE

mux

LRMMP

SNM

CP

user login info SRM

sliver tbl

Resource DB

…LNM


PLC



4 3 2 1

slices ...

retrieve/update slicedescriptions

update local database, allocate slice instances (slivers) to GPE nodes

slices ...slices ...slices ...



Creating Local Slice Instance

x

xxxx

x

LC

Substrate


PE NPE GPE

mux

LRMMP

SNM

CP

user login info SRM

Resource DB

…LNM


PLC



4 3 2 1

retrieve/update slicedescriptions

slices ...slices ...slices ...

slices ...

create new slice

sliver tbl



Alt. Hub(Logical Slot 2, Channel 2)

Base SW

SFP XFP XFP

Fabric SW

System Resource Manager

SRM

node components not in hub(switch, GPEs, Development Hosts)

Primary Hub(Logical Slot 1, Channel 1)

Base SWSFP XFP XFP

Fabric SW FPkFPkFPk

NPE

SRAM

TCAM

SCD

GPE

LRM

LNM


LC SCD

TCAM

snmpd snmpd

MUX

Resource DB



System Resource Manager• Maintains table describing system hardware components

and their attributes– NPEs and code-option

– GPEs and HW attributes

• Sliver attributes corresponding to internal representations and control mechanisms:– unique Sliver ID (aka meta-router ID)

– global port space across assigned IP addresses

– fast path VLAN assignment and corresponding IP Subnets

• Manage fabric Ethernet switches (including any used external to the Chassis or in a multi-chassis scenario)

• Manage line card table entries



System Resource management• Allocate Global port space

– input: Slice ID, [Global IP address=0, proto=UDP, Port=0]– actions: allocate port– output: {IP Address, Port, Proto} or 0 [can’t allocate]

• Allocate Sliver ID– input: Slice name– actions:

• Allocate unique Sliver ID and assign to slice• allocate VLAN ID (1-to-1 map of sliver ID to VLAN)

– output: {Sliver ID, VLAN ID}• Allocate NPE code option (internal)

– input: Sliver ID, code option id– action: Assign NPE ‘slot’ to slice

• Allocate code option instance from an eligible NPE; {NPE, instance ID}• Allocate memory block for instance (the instance ID is just an index into an array of

preallocated memory blocks).

– output: NPE Instance = {NPE ID, Slot Number}• Allocate Stats Index



System Resource manager• Add Tunnel (aka Meta-Interface) to NPE Instance:

– input: Sliver ID, NPE Instance, {IP Address, UDP Port}– actions:

• Add mapping to NPE demux table [VLAN:IP Addr:UDP Port <-> Instance ID]• Update instance’s attribute block

{tunnel fields, exception/local delivery, QID, physical port, Ethernet addr for NPE/LC}• Update next hop table (result index map to next hop tunnel)• Set default QM weights, number of queues, thresholds.• Update Line Card Ingress and Egress lookup tables: tunnel, NPE Ethernet address,

physical port, QIDs etc.??• Update LC ingress and egress queue attributes for tunnel??

• Create NPE Sliver instance:– Input: Slice ID; {IP address, UDP Port}; {Interface ID, Physical Port} {SRAM

block; # filter table entries; # of queues queues; # of packet buffers; code option; amount of SRAM required; total reserved bandwidth}

– Actions:• Allocate NPE code option• Add tunnel to NPE Instance• enable Sliver VLAN on associated fabric interface ports• delegate to LRM: configure GPE vnet module (via LRM) to accept Sliver’s VLAN traffic.

Open UDP Port for data and control in root context and pass back to client.– output: (NPE code option) Instance number



Local Resource manager• Act as intermediary between client virtual machines and

the node control infrastructure.– all exported interfaces are implemented by the LRM

• managing the life cycle of an NPE code instance• accessing instance data and memory locations

• read/write to code option instance’s memory block• get/set queue attributes {threshold, weight}• get/add/remove/update lookup table entries (i.e. TCAM

filters)• get/clear pre/post queue counters, for a given stats index

– one-time or periodic get

• get packet/byte counter for tunnel at Line card• allocate/release local Port



Allocating NPE (Creating Meta-Router)

x

xxxx

x

LC

Substrate


PE

NPE

GPE

mux

LRMMP

SNM

CP

user login info SRM

Resource DB

…LNM


PLC



4 3

Allocate NPE sliver {code option, SRAM,

Interfaces/Ports, etc}

FPk

MI1

VLANk

FP - fast path

Allocate shared NPE resources, associate with new slice fast path {SRAM block; # filter table

entries; # of queues queues; # of packet buffers; code option; amount of SRAM required; total

reserved bandwidth}

Allocate and Enable VLAN to isolate internal

slice traffic, VLANk

Open local socket for exception and local

delivery traffic; return to client vserver

Allocate global UDP port for requested interface(s);

configure Line card.

lkup

tbl

Control Interface

Fast Path

...

Forward request to System resource

manager

SRAMTCAM

Returns status and assigned global Port

number

2 1

sliver tbl



Managing the Data Path

xx

x

NPE

GPE

LRMSCD

SNM

CP

user login info SRM

Resource DB

LNM



10GbE (fabric, data)6

2 1

• Allocate or Delete NPE Slice instance

• Add, remove or alter filters– each slice is allocated a

portion of the NPE’s TCAM

• Read or write to per slice memory blocks in SRAM– each slice is allocated a

block of SRAM

• Read counters– one time or periodic

• Set Queue rate or threshold.

• Get queue lengths

SRAM

DPlDPlFPk

TCAM

FP - fast path

sliver tbl



Other LC Functions• Line Card Table maintenance

– multi-homed SPP node must be able to send packets to the correct next hop router/endsystem – random traffic from/to the GPE must be handled correctly– tunnels represent point-to-point connections so it may be alright to explicitly indicate which of

possibly several interfaces and next (Ethernet) hop devices the tunnel should be bound– alternatively if were are running the routing protocols we could provide the user with the output port

as a utility program.– But there are problems with running routing protocols: we could forward all route updates to the CP.

But standard implementations assume the interfaces are physically connected to the endsystem. – We could play tricks as vini does.– or we assume that there is only one interface connected to one Ethernet device.

• NAT Functions– traffic originating from within SPP– may also want to selective map global proto/port number to specific GPEs?

• ARP and FIB on Line card– route daemon runs on CP and keeps FIB up to date– ARP runs on xscale and maps FIB next hop entries to their corresponding Ethernet destination

addresses.• netflow

– flow-based statistics collection– SRM collects periodically and posts via web



Other Functions

• vnet– isolation based on VLAN IDs

– support port reservations

• ssh forwarding– maintain user login information on CP

– modify ssh daemon (or have wrapper) to forward user logins to correct GPE

• rebooting Node (spp), even when line card fails??

supercharged planetlab platform, control overview

Documents