cloud on steroids accelerating your cloud via cyborg
TRANSCRIPT
2018 Lenovo Internal. All rights reserved.
Cloud on steroids
Accelerating your cloud via cyborg
Jinghua Gao, Zhenghao Wang (Staff Researcher, Lenovo Research)2018-05-23
OpenStack Vancouver Summit , May 2018
2
Necessity of Acceleration
Management
Cyborg Introduction
Demo
Summary
01
02
04
05
Agenda Lenovo’s Contribution to Cyborg03
2018 Lenovo Internal. All rights reserved.
32018 Lenovo Internal. All rights reserved.
1. Necessity of Acceleration Management
42018 Lenovo Internal. All rights reserved.
Prevalence of Accelerations
1. Virtual Networking Offloading
2. Dynamic Optimization of Packet Flow Routing
3. Load Balancing and NAT,
4. Open vSwitch, HTTPs offloading
…
1. NVMe Over Fabric Enabled Acceleration
2. High Performance Persistent Memory
…
1. vBRAS, HQoS, Multicast Offloading
2. vRAN, Cipher/Decipher Offloading
3. SBC, Media Codec Offloading
4. Tensorflow, Model Training Acceleration
5. Crpytocurrency Mining Acceleration
6. Next Generation Fire Wall (NGFW) Acceleration
…
VM/App
layer
Compute Acceleration Storage Acceleration Network Acceleration
Infrastructure
layer
ASIC GPU FPGA
Provide
Hardware Accelerators
DPDK/SPDK
Software Accelerators
Accelerators
Usage Scenarios AI NFV BlockchainGenetic
SequencingBig Data
&
…
52018 Lenovo Internal. All rights reserved.
Challenges
• Difficult to standardize various acceleration technologies – Software accelerators: DPDK, SPDK.
– Multi-vendor hardware accelerators with different architecture, like GPU, ASIC, FPGA etc.
• Complex– Different contexts and usage scenarios.
– Different forms: virtualized, shared by time, pass-through, etc.
• Expensive– Non-trivial management efforts
– High price of hardware.
Cyborg Project
Need a unified acceleration management framework to enable acceleration as a service
62018 Lenovo Internal. All rights reserved.
2. Cyborg Introduction
72018 Lenovo Internal. All rights reserved.
• General management framework– Software accelerators: DPDK/SPDK, PMEM, XDP/eBPF, ...
– Hardware accelerators : FPGA, GPU, QAT, NVMe SSD,
CCIX based Caches….
• Lifecycle management of accelerators– Discovery, Program, Attach, Detach, Remove
Accelerators
Discovery
Program
AttachDetach
Remove
Timeline and Definition
Rocky Release
os-acc
Xilinx FPGA driver
pythonclient
Nomad repo
established
Feb 2016
Apr 2016
Oct 2016 Feb 2017
Sep 2017
Feb 2018 Sep 2018
First BOF session
at Austin
First design session
in Barcelona
Rename to cyborg
Pike PTG
Becomes official
project
Queens PTG
Queens Release
API-DB
Conductor-Agent
Generic Driver
8
Architecture
cyborg-api
cyborg-conductor cyborg-db
cyborg-agent
fpga-driver gpu-driver
vendor-a-fpga-driver vendor-b-fpga-driver vendor-c-gpu-driver
spdk-driver…
controller-node
compute-node
2018 Lenovo Internal. All rights reserved.
92018 Lenovo Internal. All rights reserved.
Interaction with Other Projects
Attached to the VM where
workload demands acceleration.
Two main use case groups Other projects
Nova
FPGA(Intel & Xilinx)
Accelerator examples
Nova & Glance
Used by infrastructure, and then
utilized via appropriate service.
GPU, QAT…
DPDK/SPDK
10
Interaction with Nova
• Work with Nova through three steps:
Representation
at Discovery
Instance
placement/
scheduling
Attaching
accelerators to
Instances32
2018 Lenovo Internal. All rights reserved.
nova-api
nova-conductor
nova-scheduler
nova-compute
hypervisor
cyborg-api
cyborg-conductor
cyborg-agentDriver A Driver B Driver C
nova-placement-api
accelerators
update
cyborg-db
Upstream:
controllercompute
1
11
Interaction with Nova
• Work with Nova through three steps:
Representation
at Discovery
Instance
placement/
scheduling
Attaching
accelerators to
Instances
1 32
2018 Lenovo Internal. All rights reserved.
nova-api
nova-conductor
nova-scheduler
nova-compute
hypervisor
cyborg-api
cyborg-conductor
cyborg-agentDriver A Driver B Driver C
nova-placement-api
accelerators
update
cyborg-db
Upstream:
controllercompute
filter/weigher
12
Interaction with Nova
• Work with Nova through three steps:
Representation
at Discovery
Instance
placement/
scheduling
Attaching
accelerators to
Instances
1 32
2018 Lenovo Internal. All rights reserved.
nova-api
nova-conductor
nova-scheduler
nova-compute
hypervisor
cyborg-api
cyborg-conductor
cyborg-agentDriver A Driver B Driver C
nova-placement-api
accelerators
update
cyborg-db
Upstream:
controllercompute
filter/weigher
os-acc
132018 Lenovo Internal. All rights reserved.
3. Lenovo’s Contributionto Cyborg
142018 Lenovo Internal. All rights reserved.
Real World Requirements
AINFV Blockchain Big Data
GPU FPGANVMe
SSD
Accelerators
Netronome
smartnic
cavium
smartnic
Intel QAT
Hypervisor
DPDK
Neutron
OpenStack
Nova
API
Conductor
Agent
cyborg
Driver
...
NFVVNF(vRAN, vBRAS, SBC…) / Infrastructure( NGFW, OVS…)
High performance – 10~100Gbps up
High reliability – up time of 99.999%
Low-latency -- less than 100ms usually
152018 Lenovo Internal. All rights reserved.
Lenovo’s Efforts on Cyborg
• Integrate with nova.– Provide an acceleration solution without
nova-placement.
– Provide the accelerator during VM boot time or via a separate attach/detach action.
• Extend drivers– Use upstream FPGA driver
– Add GPU, Netronome driver etc.
• There are still productions before newton release don’t have nova-placement.
• To dynamically use accelerators.
• To accelerate different workloads.
16
Boot Time Attachment
Cyborg Use Case: GPU 1/2
nova-api
nova-conductornova-scheduler
nova-compute
Hypervisor
cyborg-api
cyborg-conductor
cyborg-agentDriver A Driver B Driver C
Accelerators
cyborg-db
Resource updating at discoveryPeriodically update to cyborg-db.
Instance scheduling1. Create VMs with specific image properties.
2. Scheduling using acc_filter.
3. Cyborg return the compute nodes list.
Attaching accelerators to Instances1. Call cyborg to claim required GPU resource.
2. Define the XML with GPU pci_address.
3. Run VM, If fail, call cyborg to release the
allocated GPU resource.
periodically retrieve
acc_filtercontrollercompute
image_propeties
claim resources
2018 Lenovo Internal. All rights reserved.
17
Run-time Attachment(Hot-plug)
Cyborg Use Case: GPU 2/2
Command:
nova accelerator-attach instance_id --type
GPU
Difference with boot time attachment:1. Query nova-db to get instance location.
2. Call cyborg to get accelerator list.
3. Add a new XML file and attach to VM.
nova-api
nova-compute
Hypervisor
cyborg-api
cyborg-conductor
cyborg-agentDriver A Driver B Driver C
Accelerators
cyborg-db
controllercompute
2018 Lenovo Internal. All rights reserved.
18
Cyborg Use Case: FPGA1. Use image properties to define the accelerator type & fpga function.
-Request-time Programming
2. Use existing glance table for FPGA bitstreams. Difference with GPU
attachment workflow:
1. Nova-compute call cyborg &
periodically check the
program status of bitstream
programming.
2. Cyborg get bitstreams from
glance then program it to
FPGA.
3. Change “type” of FPGA pf/vf.
The reason to change the
type of vf/pf is that resources
may be different in the
hypervisior level to be
attachded.
e.g. if the FPGA pf/vf is
programed with a given NiC
bitsreams, then cyborg should
change the type from fpga to
smartnic.
glance
2018 Lenovo Internal. All rights reserved.
nova-api
nova-conductornova-scheduler
nova-compute
Hypervisor
cyborg-api
cyborg-conductor
cyborg-agentDriver A Driver B Driver C
Accelerators
cyborg-dbtype
controllercompute
192018 Lenovo Internal. All rights reserved.
4. DemoVM provisioning with GPU pass-through
20
Internet
2018 Lenovo Internal. All rights reserved.
Environment
• Lenovo ThinkCloud OpenStack 4.2 Version– 3 nodes, 1 controller node and 2 compute nodes.
– One compute node with NVIDIA GPU.
• Demo: VM Provisioning with GPU Pass-through
node-4
controller
node-5
compute
node-6
compute
G
P
U
Internet
SwitchThinkCloud OpenStack 4.2
212018 Lenovo Internal. All rights reserved.
5. Summary
222018 Lenovo Internal. All rights reserved.
Summary
• Achievements– Use cyborg to manage different accelerators in Lenovo Product.
– Integrate with nova, form a standard workflow of creating VM with GPU/FPGA… pass-through.
• Future Work
– Support sharing accelerator hardware among VMs.- Cyborg-driver support for discovering and storing shared accelerators.
– Application Plugin mechanism of cyborg-api etc.
232018 Lenovo Internal. All rights reserved.
Q&A
• Jinghua Gao– Email: [email protected]
– Twitter: @Miss_Coco_Gao
– IRC: coco
– Network acceleration & Datacenter traffic analysis
• Zhenghao Wang– Email: [email protected]
– IRC: wangzhh
– OpenStack Zun&Cyborg contributor
– Cloud computing researcher at Lenovo