b4m36esw: ef˙cient software - cw.fel.cvut.cz · vm-entry control elds vm-exit information elds...
TRANSCRIPT
Outline
1 Introduction
2 Hardware assisted virtualization
3 Example: Mini VMM with KVM
4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking
5 Summary
2 / 27
Introduction
Outline
1 Introduction
2 Hardware assisted virtualization
3 Example: Mini VMM with KVM
4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking
5 Summary
3 / 27
Introduction
Virtualization
Definition: Virtualization of the whole computing platform – theoperating system thinks it runs on real hardware, but the hardwareis largely emulated by hypervisor and/or virtual machine monitor(VMM).Virtual machine (VM) vs. Java VM
Java VM interprets Java byte code and interacts with an operatingsystemVM executes native (machine) code and interacts with a hypervisor.
Used since ’70, mostly on IBM mainframesPopek and Goldberg defined requirements for ISA virtualization intheir paper in 1974,x86 became fully virtualizable in 2005.
4 / 27
Introduction
Trap-and-emulate
Unprivileged (VM)
Privileged (HV)
Trap
Emulate
Basic principle of virtualizationPopek and Goldberg: “All sensitive instructions must be privilegedinstructions”
Sensitive instruction: Changes global state or behaves differentlydepending on global state (e.g. cli, pushf on x86)Privileged instruction: Unprivileged execution traps to theprivileged mode (hypervisor, CPU exception)on x86 popf, pushf and few other instructions were not privileged!
pushf stores all flags to stack (including “global” interrupt flag)popf sets IF in privileged mode and ignores it in unprivileged mode(does not trap)
Hypervisor (HV) can emulate the effect of sensitive instructionsdepending on the VM state (not the global state).
5 / 27
Introduction
Trap-and-emulate
Unprivileged (VM)
Privileged (HV)
Trap
Emulate
Basic principle of virtualizationPopek and Goldberg: “All sensitive instructions must be privilegedinstructions”
Sensitive instruction: Changes global state or behaves differentlydepending on global state (e.g. cli, pushf on x86)Privileged instruction: Unprivileged execution traps to theprivileged mode (hypervisor, CPU exception)on x86 popf, pushf and few other instructions were not privileged!
pushf stores all flags to stack (including “global” interrupt flag)popf sets IF in privileged mode and ignores it in unprivileged mode(does not trap)
Hypervisor (HV) can emulate the effect of sensitive instructionsdepending on the VM state (not the global state).
5 / 27
Introduction
Trap-and-emulate
Unprivileged (VM)
Privileged (HV)
Trap
Emulate
Basic principle of virtualizationPopek and Goldberg: “All sensitive instructions must be privilegedinstructions”
Sensitive instruction: Changes global state or behaves differentlydepending on global state (e.g. cli, pushf on x86)Privileged instruction: Unprivileged execution traps to theprivileged mode (hypervisor, CPU exception)on x86 popf, pushf and few other instructions were not privileged!
pushf stores all flags to stack (including “global” interrupt flag)popf sets IF in privileged mode and ignores it in unprivileged mode(does not trap)
Hypervisor (HV) can emulate the effect of sensitive instructionsdepending on the VM state (not the global state).
5 / 27
Introduction
Hypervisor
Privileged code that supervises execution of the VM, i.e. handles traps.Hypervisor types:
HW
Hypervisor
Guest 1 Guest 2
Guest apps Guest appsGuest OS kernel Guest OS kernel
HW
Host OS kernel
Guest
Guest apps
Host apps
Guest OS kernel
Hypervisor
Bare-metal hypervisor (Xen,VMware ESX, ...)
Hosted hypervisor (KVM,VirtualBox)
The boundary is blurry – many bare-metal hypervisors support native apps
6 / 27
Introduction
Virtual Machine Monitor (VMM)
HW
Hypervisor
Guest 1 Guest 2
Guest apps Guest appsGuest OS kernel Guest OS kernel
VMM
HW
Hypervisor
Guest 1 Guest 2
Guest apps Guest appsGuest OS kernel Guest OS kernel
VMM 2VMM 1
Software that emulates HW platform (network, graphics, storage, . . . )Often implemented inside hypervisor ⇒ people confuse VMM with hypervisorsToday’s platforms are complex (e.g. PC bears 40 years heritage)It is more secure to execute the VMM outside of privileged mode (example: KVM &qemu)It is also slower, but see NOVA microhypervisor (TU Dresden), which implementsthis faster.
7 / 27
Introduction
Questions
How many privilege levels we need to implement virtualization?
Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.
Why is virtualization needed at all? (My personal rant)To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.
8 / 27
Introduction
Questions
How many privilege levels we need to implement virtualization?Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.
Why is virtualization needed at all? (My personal rant)To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.
8 / 27
Introduction
Questions
How many privilege levels we need to implement virtualization?Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.
Why is virtualization needed at all? (My personal rant)
To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.
8 / 27
Introduction
Questions
How many privilege levels we need to implement virtualization?Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.
Why is virtualization needed at all? (My personal rant)To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.
8 / 27
Hardware assisted virtualization
Outline
1 Introduction
2 Hardware assisted virtualization
3 Example: Mini VMM with KVM
4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking
5 Summary
9 / 27
Hardware assisted virtualization
Hardware assisted virtualization
Accelerates virtualized executionDifferences between vendors (Intel, AMD, ARM, . . . ), coreprinciples similar:
More privilege levels (x86 – root/non-root, ARMv8 EL0–3)Nested pagingIO virtualization
10 / 27
Hardware assisted virtualization
Intel VMXVMX root operation(host rings 0–3)VMX non-root operation(guest rings 0–3)root→non-root = VM Enter
instructions: vmlaunch, vmresumenon-root→root = VM Exit
instructions: vmresume, vmcallfaults (e.g. I/O)
Guest state
Host state
VM-execution control fields
VM-exit control fields
VM-entry control fields
VM-exit information fields
VMCS (up to 4 KiB – e.g. 1024 B)
VM Control Structure (VMCS)Data structure in memory that controls VMX execution(Re)stores host/guest state“Large structure” ⇒ VM Enter/Exit has overheadThe overhead depends on what is (re)stored from/to VMCS (configurable)
11 / 27
Hardware assisted virtualization
Nested paging & address spaces
Host physical
Guest virtualHost virtualGuest physical
Guest page tablesHost page tables
HW Hypervisor Guest
PDPT PD PT
12 / 27
Hardware assisted virtualization
Memory access overhead
TLB misses and page faults are more expensive in a VM!Page walk in a VM (worst case):
1 Translate PDPT (CR3) address using host page tables (3 memoryaccesses for 3-level page tables)
2 Translate PD address using host page tables (3 accesses)3 Translate PT address using host page tables (3 accesses)
Performance drop up to 15/38% (Intel/AMD)1
Tagged TLBsNo need to flush TLBs on process (or VM) switches (good)Applications share TLBs with hypervisor and VMM (bad)
Use huge pages if possible
1Ulrich Drepper, The Cost of Virtualization, ACM Queue, Vol. 6 No. 1 – 200813 / 27
Example: Mini VMM with KVM
Outline
1 Introduction
2 Hardware assisted virtualization
3 Example: Mini VMM with KVM
4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking
5 Summary
14 / 27
Example: Mini VMM with KVM
KVM
Linux-based hosted hypervisorAbstracts hardware-assisted virtualization of differentarchitectures behind ioctl-based API
We will develop a miniature user-space VMMSimplest hardware to virtualize: serial port
1 Setup the VM’s memory2 Load the code to execute3 Run the VM4 Handle the VM Exits and emulate serial port5 Goto 3
See also https://lwn.net/Articles/658511/
15 / 27
Example: Mini VMM with KVM
KVM
Linux-based hosted hypervisorAbstracts hardware-assisted virtualization of differentarchitectures behind ioctl-based APIWe will develop a miniature user-space VMM
Simplest hardware to virtualize: serial port
1 Setup the VM’s memory2 Load the code to execute3 Run the VM4 Handle the VM Exits and emulate serial port5 Goto 3
See also https://lwn.net/Articles/658511/
15 / 27
I/O virtualization
Outline
1 Introduction
2 Hardware assisted virtualization
3 Example: Mini VMM with KVM
4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking
5 Summary
16 / 27
I/O virtualization
Network Interface Card (NIC)NIC Registers
Buffer descriptors(in memory)
Packet data
TX headTX tailRX headRX tail
...
...
addrsize, flags
TX operation:1 Write packet data2 Fill in empty buffer descriptor3 Notify NIC by writing TX tail reg
17 / 27
I/O virtualization
Network Interface Card (NIC)NIC Registers
Buffer descriptors(in memory)
Packet data
TX headTX tailRX headRX tail
...
...
addrsize, flags
RX operation:1 Allocate packet buffers
and update buffer descriptors2 Update RX head/tail regs3 On packet RX, NIC generates an interrupt
17 / 27
I/O virtualization
Network Interface Card (NIC)NIC Registers
Buffer descriptors(in memory)
Headres &Packet data
TX headTX tailRX headRX tail
...
...
addr[2]size[2], flags
Scatter-Gather DMA:Final packet composed piecesscattered in memoryTypically header (from OS) anddata (from app) 17 / 27
I/O virtualization » Device emulation
NIC device emulation
Trap accesses to NIC registers (memory-mapped IO)Upon write to TX tail, VMM iterates over queued buffers and sendthem via real NIC (e.g. SOCK_RAW)Multiple packets can be sent with single VM ExitReception works similarly
Not all hardware is “that nice” to virtualizeSeveral VM Exits per TX or RXRegisters that must be trapped are intermixed with non-sensitive(e.g. read-only) registers in a single page
Unnecessary VM Exits for some register accessesVMM must emulate not only RX/TX, but also management
Link negotiation, configuration, . . .Much more complex compared to RX/TX
18 / 27
I/O virtualization » Device emulation
NIC device emulation
Trap accesses to NIC registers (memory-mapped IO)Upon write to TX tail, VMM iterates over queued buffers and sendthem via real NIC (e.g. SOCK_RAW)Multiple packets can be sent with single VM ExitReception works similarly
Not all hardware is “that nice” to virtualizeSeveral VM Exits per TX or RXRegisters that must be trapped are intermixed with non-sensitive(e.g. read-only) registers in a single page
Unnecessary VM Exits for some register accessesVMM must emulate not only RX/TX, but also management
Link negotiation, configuration, . . .Much more complex compared to RX/TX
18 / 27
I/O virtualization » Virtio
Virtio
It is neither easy on necessary to emulate real NICTX, RX and simple configuration (e.g. MAC address) is sufficientWhy to implement different ring-buffer formats?
Virtio2
Universal ring-buffer-based communication between VM and HVUsed for network, storage, serial line, . . .PCI-based probing & configuration – VMs can easily discover virtiodevices
2R. Russell, virtio: Towards a De-Facto Standard For Virtual I/O Devices, ACMSIGOPS Operating Systems Review, 2008
19 / 27
I/O virtualization » Virtio
Virtio
It is neither easy on necessary to emulate real NICTX, RX and simple configuration (e.g. MAC address) is sufficientWhy to implement different ring-buffer formats?
Virtio2
Universal ring-buffer-based communication between VM and HVUsed for network, storage, serial line, . . .PCI-based probing & configuration – VMs can easily discover virtiodevices
2R. Russell, virtio: Towards a De-Facto Standard For Virtual I/O Devices, ACMSIGOPS Operating Systems Review, 2008
19 / 27
I/O virtualization » PCI pass-through
PCI pass-through
Even virtio needs one VM Exit per (a batch of) TX operation(s)If we don’t want VM Exits, we may want to give a VM exclusiveaccess to the NICFew problems to solve. . .
20 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HVDevice page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HVDevice page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HVDevice page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HVDevice page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HVDevice page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HV
Device page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through graphically
Host physical
Guest virtual
Guest page tablesHost page tables
Hypervisor Guest
PDPT PD PT
Chipset/PCI RootComplex
CPUCPU
interconnect
MemoryMemory bus
DevicePCIe
Host virtualGuest physical
NIC regs
NIC regs
NIC buffs
NIC buffs
HV
Device page tables
Device
Device
NIC regs
NIC buffs
21 / 27
I/O virtualization » PCI pass-through
PCI pass-through
Problems:1 Virtual address space (see previous slide)2 Device interrupts
Host does not know how to acknowledge (silence) the interrupt – ithas no driver for the deviceIt injects interrupt to the VM and returns from IRQ handlerHost is interrupted again, because VM didn’t have chance to run andack the interrupt
Solution: Hardware support for device use in VMsIOMMU (AMD), VT-d (Intel), SMMU (ARM)Mask individual sources of interrupts without understanding thedevice
Hard with PCI, where interrupt lines are shared between devicesPossible with Message Signaled Interrupts (MSI)
22 / 27
I/O virtualization » Single-Root I/O Virtualization
Single-Root I/O Virtualization (SR-IOV)PCI pass-through is nice, but I have more VMs that want to communicate. . .
Each VM has emulated NIC, VMM multiplexes the real NIC between VMs in software
. . . or perform the multiplexing in hardware
NICPF VF VF VF VF VF VF VF VF
Hypervisor
Guest 1 Guest 2
Guest apps Guest appsGuest OS kernel Guest OS kernel
VMM 2VMM 1Guest 3
Guest appsGuest OS kernel
VMM 3
SR-IOVBesides “classic” physical function (PF), NIC implements several virtual functions (VFs)Each VF provides simplified PCI interface and its own RX/TX ring buffers
23 / 27
I/O virtualization » Single-Root I/O Virtualization
Single-Root I/O Virtualization (SR-IOV)PCI pass-through is nice, but I have more VMs that want to communicate. . .
Each VM has emulated NIC, VMM multiplexes the real NIC between VMs in software. . . or perform the multiplexing in hardware
NICPF VF VF VF VF VF VF VF VF
Hypervisor
Guest 1 Guest 2
Guest apps Guest appsGuest OS kernel Guest OS kernel
VMM 2VMM 1Guest 3
Guest appsGuest OS kernel
VMM 3
SR-IOVBesides “classic” physical function (PF), NIC implements several virtual functions (VFs)Each VF provides simplified PCI interface and its own RX/TX ring buffers
23 / 27
I/O virtualization » Inter-VM networking
Inter-VM networking
Hypervisor
Guest 1 Guest 2
Guest apps Guest appsGuest OS kernel Guest OS kernel
VMM 2VMM 1
SW switchShared memwith VMMs
Packet stored in VM’s memoryVMM notified (VM Exit) e.g. via virtio’s kick()VMM notifies the SW switch via standard IPC mechanismSwitch does memcpy() of the packet from source VM to destination VM(into dest NIC ring buffer)Dest VMM notifies the VM (inject interrupt)
24 / 27
I/O virtualization » Inter-VM networking
Optimizations
OS networking stack is responsible for splitting application data topackets (e.g. TCP segmentation) and adding appropriate headersVMM sees many small packets and switch does many smallmemcpy()sReceiver’s networking stack strips packet headers and combinesthe payload to larger data chunks for application.
Segmentation is not necessary for Inter-VM communication(overhead)!Modern NICs support TCP Segmentation Offload (TSO)/LargeReceive Offload (LRO): Segmentation/reconstruction is done inhardware.If virtual NIC supports TSO/LRO, Inter-VM communication is muchfaster.
25 / 27
I/O virtualization » Inter-VM networking
Optimizations
OS networking stack is responsible for splitting application data topackets (e.g. TCP segmentation) and adding appropriate headersVMM sees many small packets and switch does many smallmemcpy()sReceiver’s networking stack strips packet headers and combinesthe payload to larger data chunks for application.
Segmentation is not necessary for Inter-VM communication(overhead)!Modern NICs support TCP Segmentation Offload (TSO)/LargeReceive Offload (LRO): Segmentation/reconstruction is done inhardware.If virtual NIC supports TSO/LRO, Inter-VM communication is muchfaster.
25 / 27
Summary
Outline
1 Introduction
2 Hardware assisted virtualization
3 Example: Mini VMM with KVM
4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking
5 Summary
26 / 27