b4m36esw: ef˙cient software - cw.fel.cvut.cz · vm-entry control elds vm-exit information elds...

45
BM6ESW: Efcient software Lecture : Virtualization Michal Sojka [email protected] May , /

Upload: trinhnhi

Post on 20-Mar-2019

226 views

Category:

Documents


0 download

TRANSCRIPT

B4M36ESW: Efficient softwareLecture 12: Virtualization

Michal [email protected]

May 15, 2017

1 / 27

Outline

1 Introduction

2 Hardware assisted virtualization

3 Example: Mini VMM with KVM

4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking

5 Summary

2 / 27

Introduction

Outline

1 Introduction

2 Hardware assisted virtualization

3 Example: Mini VMM with KVM

4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking

5 Summary

3 / 27

Introduction

Virtualization

Definition: Virtualization of the whole computing platform – theoperating system thinks it runs on real hardware, but the hardwareis largely emulated by hypervisor and/or virtual machine monitor(VMM).Virtual machine (VM) vs. Java VM

Java VM interprets Java byte code and interacts with an operatingsystemVM executes native (machine) code and interacts with a hypervisor.

Used since ’70, mostly on IBM mainframesPopek and Goldberg defined requirements for ISA virtualization intheir paper in 1974,x86 became fully virtualizable in 2005.

4 / 27

Introduction

Trap-and-emulate

Unprivileged (VM)

Privileged (HV)

Trap

Emulate

Basic principle of virtualizationPopek and Goldberg: “All sensitive instructions must be privilegedinstructions”

Sensitive instruction: Changes global state or behaves differentlydepending on global state (e.g. cli, pushf on x86)Privileged instruction: Unprivileged execution traps to theprivileged mode (hypervisor, CPU exception)on x86 popf, pushf and few other instructions were not privileged!

pushf stores all flags to stack (including “global” interrupt flag)popf sets IF in privileged mode and ignores it in unprivileged mode(does not trap)

Hypervisor (HV) can emulate the effect of sensitive instructionsdepending on the VM state (not the global state).

5 / 27

Introduction

Trap-and-emulate

Unprivileged (VM)

Privileged (HV)

Trap

Emulate

Basic principle of virtualizationPopek and Goldberg: “All sensitive instructions must be privilegedinstructions”

Sensitive instruction: Changes global state or behaves differentlydepending on global state (e.g. cli, pushf on x86)Privileged instruction: Unprivileged execution traps to theprivileged mode (hypervisor, CPU exception)on x86 popf, pushf and few other instructions were not privileged!

pushf stores all flags to stack (including “global” interrupt flag)popf sets IF in privileged mode and ignores it in unprivileged mode(does not trap)

Hypervisor (HV) can emulate the effect of sensitive instructionsdepending on the VM state (not the global state).

5 / 27

Introduction

Trap-and-emulate

Unprivileged (VM)

Privileged (HV)

Trap

Emulate

Basic principle of virtualizationPopek and Goldberg: “All sensitive instructions must be privilegedinstructions”

Sensitive instruction: Changes global state or behaves differentlydepending on global state (e.g. cli, pushf on x86)Privileged instruction: Unprivileged execution traps to theprivileged mode (hypervisor, CPU exception)on x86 popf, pushf and few other instructions were not privileged!

pushf stores all flags to stack (including “global” interrupt flag)popf sets IF in privileged mode and ignores it in unprivileged mode(does not trap)

Hypervisor (HV) can emulate the effect of sensitive instructionsdepending on the VM state (not the global state).

5 / 27

Introduction

Hypervisor

Privileged code that supervises execution of the VM, i.e. handles traps.Hypervisor types:

HW

Hypervisor

Guest 1 Guest 2

Guest apps Guest appsGuest OS kernel Guest OS kernel

HW

Host OS kernel

Guest

Guest apps

Host apps

Guest OS kernel

Hypervisor

Bare-metal hypervisor (Xen,VMware ESX, ...)

Hosted hypervisor (KVM,VirtualBox)

The boundary is blurry – many bare-metal hypervisors support native apps

6 / 27

Introduction

Virtual Machine Monitor (VMM)

HW

Hypervisor

Guest 1 Guest 2

Guest apps Guest appsGuest OS kernel Guest OS kernel

VMM

HW

Hypervisor

Guest 1 Guest 2

Guest apps Guest appsGuest OS kernel Guest OS kernel

VMM 2VMM 1

Software that emulates HW platform (network, graphics, storage, . . . )Often implemented inside hypervisor ⇒ people confuse VMM with hypervisorsToday’s platforms are complex (e.g. PC bears 40 years heritage)It is more secure to execute the VMM outside of privileged mode (example: KVM &qemu)It is also slower, but see NOVA microhypervisor (TU Dresden), which implementsthis faster.

7 / 27

Introduction

Questions

How many privilege levels we need to implement virtualization?

Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.

Why is virtualization needed at all? (My personal rant)To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.

8 / 27

Introduction

Questions

How many privilege levels we need to implement virtualization?Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.

Why is virtualization needed at all? (My personal rant)To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.

8 / 27

Introduction

Questions

How many privilege levels we need to implement virtualization?Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.

Why is virtualization needed at all? (My personal rant)

To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.

8 / 27

Introduction

Questions

How many privilege levels we need to implement virtualization?Two are sufficient, but then, every guest system call, page fault etc.traps from the guest app to the hypervisor, which then arrangesswitch to guest kernel – slow.Hardware assisted virtualization – introduces more privilege levelsand more – see later.

Why is virtualization needed at all? (My personal rant)To some extent because the design of mainstream operatingsystems is not up to the current needs.Current OSes do not offer sufficient isolation of applications andgroups of applications. Many things such as user permissions,apply implicitly to the whole system.Microkernel OSes, which solve this problem, were designed in thepast without much success.Now, people are adding “containers” to mainstream OSes, which ispainful and often with security problems.

8 / 27

Hardware assisted virtualization

Outline

1 Introduction

2 Hardware assisted virtualization

3 Example: Mini VMM with KVM

4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking

5 Summary

9 / 27

Hardware assisted virtualization

Hardware assisted virtualization

Accelerates virtualized executionDifferences between vendors (Intel, AMD, ARM, . . . ), coreprinciples similar:

More privilege levels (x86 – root/non-root, ARMv8 EL0–3)Nested pagingIO virtualization

10 / 27

Hardware assisted virtualization

Intel VMXVMX root operation(host rings 0–3)VMX non-root operation(guest rings 0–3)root→non-root = VM Enter

instructions: vmlaunch, vmresumenon-root→root = VM Exit

instructions: vmresume, vmcallfaults (e.g. I/O)

Guest state

Host state

VM-execution control fields

VM-exit control fields

VM-entry control fields

VM-exit information fields

VMCS (up to 4 KiB – e.g. 1024 B)

VM Control Structure (VMCS)Data structure in memory that controls VMX execution(Re)stores host/guest state“Large structure” ⇒ VM Enter/Exit has overheadThe overhead depends on what is (re)stored from/to VMCS (configurable)

11 / 27

Hardware assisted virtualization

Nested paging & address spaces

Host physical

Guest virtualHost virtualGuest physical

Guest page tablesHost page tables

HW Hypervisor Guest

PDPT PD PT

12 / 27

Hardware assisted virtualization

Memory access overhead

TLB misses and page faults are more expensive in a VM!Page walk in a VM (worst case):

1 Translate PDPT (CR3) address using host page tables (3 memoryaccesses for 3-level page tables)

2 Translate PD address using host page tables (3 accesses)3 Translate PT address using host page tables (3 accesses)

Performance drop up to 15/38% (Intel/AMD)1

Tagged TLBsNo need to flush TLBs on process (or VM) switches (good)Applications share TLBs with hypervisor and VMM (bad)

Use huge pages if possible

1Ulrich Drepper, The Cost of Virtualization, ACM Queue, Vol. 6 No. 1 – 200813 / 27

Example: Mini VMM with KVM

Outline

1 Introduction

2 Hardware assisted virtualization

3 Example: Mini VMM with KVM

4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking

5 Summary

14 / 27

Example: Mini VMM with KVM

KVM

Linux-based hosted hypervisorAbstracts hardware-assisted virtualization of differentarchitectures behind ioctl-based API

We will develop a miniature user-space VMMSimplest hardware to virtualize: serial port

1 Setup the VM’s memory2 Load the code to execute3 Run the VM4 Handle the VM Exits and emulate serial port5 Goto 3

See also https://lwn.net/Articles/658511/

15 / 27

Example: Mini VMM with KVM

KVM

Linux-based hosted hypervisorAbstracts hardware-assisted virtualization of differentarchitectures behind ioctl-based APIWe will develop a miniature user-space VMM

Simplest hardware to virtualize: serial port

1 Setup the VM’s memory2 Load the code to execute3 Run the VM4 Handle the VM Exits and emulate serial port5 Goto 3

See also https://lwn.net/Articles/658511/

15 / 27

I/O virtualization

Outline

1 Introduction

2 Hardware assisted virtualization

3 Example: Mini VMM with KVM

4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking

5 Summary

16 / 27

I/O virtualization

Network Interface Card (NIC)NIC Registers

Buffer descriptors(in memory)

Packet data

TX headTX tailRX headRX tail

...

...

addrsize, flags

TX operation:1 Write packet data2 Fill in empty buffer descriptor3 Notify NIC by writing TX tail reg

17 / 27

I/O virtualization

Network Interface Card (NIC)NIC Registers

Buffer descriptors(in memory)

Packet data

TX headTX tailRX headRX tail

...

...

addrsize, flags

RX operation:1 Allocate packet buffers

and update buffer descriptors2 Update RX head/tail regs3 On packet RX, NIC generates an interrupt

17 / 27

I/O virtualization

Network Interface Card (NIC)NIC Registers

Buffer descriptors(in memory)

Headres &Packet data

TX headTX tailRX headRX tail

...

...

addr[2]size[2], flags

Scatter-Gather DMA:Final packet composed piecesscattered in memoryTypically header (from OS) anddata (from app) 17 / 27

I/O virtualization » Device emulation

NIC device emulation

Trap accesses to NIC registers (memory-mapped IO)Upon write to TX tail, VMM iterates over queued buffers and sendthem via real NIC (e.g. SOCK_RAW)Multiple packets can be sent with single VM ExitReception works similarly

Not all hardware is “that nice” to virtualizeSeveral VM Exits per TX or RXRegisters that must be trapped are intermixed with non-sensitive(e.g. read-only) registers in a single page

Unnecessary VM Exits for some register accessesVMM must emulate not only RX/TX, but also management

Link negotiation, configuration, . . .Much more complex compared to RX/TX

18 / 27

I/O virtualization » Device emulation

NIC device emulation

Trap accesses to NIC registers (memory-mapped IO)Upon write to TX tail, VMM iterates over queued buffers and sendthem via real NIC (e.g. SOCK_RAW)Multiple packets can be sent with single VM ExitReception works similarly

Not all hardware is “that nice” to virtualizeSeveral VM Exits per TX or RXRegisters that must be trapped are intermixed with non-sensitive(e.g. read-only) registers in a single page

Unnecessary VM Exits for some register accessesVMM must emulate not only RX/TX, but also management

Link negotiation, configuration, . . .Much more complex compared to RX/TX

18 / 27

I/O virtualization » Virtio

Virtio

It is neither easy on necessary to emulate real NICTX, RX and simple configuration (e.g. MAC address) is sufficientWhy to implement different ring-buffer formats?

Virtio2

Universal ring-buffer-based communication between VM and HVUsed for network, storage, serial line, . . .PCI-based probing & configuration – VMs can easily discover virtiodevices

2R. Russell, virtio: Towards a De-Facto Standard For Virtual I/O Devices, ACMSIGOPS Operating Systems Review, 2008

19 / 27

I/O virtualization » Virtio

Virtio

It is neither easy on necessary to emulate real NICTX, RX and simple configuration (e.g. MAC address) is sufficientWhy to implement different ring-buffer formats?

Virtio2

Universal ring-buffer-based communication between VM and HVUsed for network, storage, serial line, . . .PCI-based probing & configuration – VMs can easily discover virtiodevices

2R. Russell, virtio: Towards a De-Facto Standard For Virtual I/O Devices, ACMSIGOPS Operating Systems Review, 2008

19 / 27

I/O virtualization » PCI pass-through

PCI pass-through

Even virtio needs one VM Exit per (a batch of) TX operation(s)If we don’t want VM Exits, we may want to give a VM exclusiveaccess to the NICFew problems to solve. . .

20 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HVDevice page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HVDevice page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HVDevice page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HVDevice page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HVDevice page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HV

Device page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through graphically

Host physical

Guest virtual

Guest page tablesHost page tables

Hypervisor Guest

PDPT PD PT

Chipset/PCI RootComplex

CPUCPU

interconnect

MemoryMemory bus

DevicePCIe

Host virtualGuest physical

NIC regs

NIC regs

NIC buffs

NIC buffs

HV

Device page tables

Device

Device

NIC regs

NIC buffs

21 / 27

I/O virtualization » PCI pass-through

PCI pass-through

Problems:1 Virtual address space (see previous slide)2 Device interrupts

Host does not know how to acknowledge (silence) the interrupt – ithas no driver for the deviceIt injects interrupt to the VM and returns from IRQ handlerHost is interrupted again, because VM didn’t have chance to run andack the interrupt

Solution: Hardware support for device use in VMsIOMMU (AMD), VT-d (Intel), SMMU (ARM)Mask individual sources of interrupts without understanding thedevice

Hard with PCI, where interrupt lines are shared between devicesPossible with Message Signaled Interrupts (MSI)

22 / 27

I/O virtualization » Single-Root I/O Virtualization

Single-Root I/O Virtualization (SR-IOV)PCI pass-through is nice, but I have more VMs that want to communicate. . .

Each VM has emulated NIC, VMM multiplexes the real NIC between VMs in software

. . . or perform the multiplexing in hardware

NICPF VF VF VF VF VF VF VF VF

Hypervisor

Guest 1 Guest 2

Guest apps Guest appsGuest OS kernel Guest OS kernel

VMM 2VMM 1Guest 3

Guest appsGuest OS kernel

VMM 3

SR-IOVBesides “classic” physical function (PF), NIC implements several virtual functions (VFs)Each VF provides simplified PCI interface and its own RX/TX ring buffers

23 / 27

I/O virtualization » Single-Root I/O Virtualization

Single-Root I/O Virtualization (SR-IOV)PCI pass-through is nice, but I have more VMs that want to communicate. . .

Each VM has emulated NIC, VMM multiplexes the real NIC between VMs in software. . . or perform the multiplexing in hardware

NICPF VF VF VF VF VF VF VF VF

Hypervisor

Guest 1 Guest 2

Guest apps Guest appsGuest OS kernel Guest OS kernel

VMM 2VMM 1Guest 3

Guest appsGuest OS kernel

VMM 3

SR-IOVBesides “classic” physical function (PF), NIC implements several virtual functions (VFs)Each VF provides simplified PCI interface and its own RX/TX ring buffers

23 / 27

I/O virtualization » Inter-VM networking

Inter-VM networking

Hypervisor

Guest 1 Guest 2

Guest apps Guest appsGuest OS kernel Guest OS kernel

VMM 2VMM 1

SW switchShared memwith VMMs

Packet stored in VM’s memoryVMM notified (VM Exit) e.g. via virtio’s kick()VMM notifies the SW switch via standard IPC mechanismSwitch does memcpy() of the packet from source VM to destination VM(into dest NIC ring buffer)Dest VMM notifies the VM (inject interrupt)

24 / 27

I/O virtualization » Inter-VM networking

Optimizations

OS networking stack is responsible for splitting application data topackets (e.g. TCP segmentation) and adding appropriate headersVMM sees many small packets and switch does many smallmemcpy()sReceiver’s networking stack strips packet headers and combinesthe payload to larger data chunks for application.

Segmentation is not necessary for Inter-VM communication(overhead)!Modern NICs support TCP Segmentation Offload (TSO)/LargeReceive Offload (LRO): Segmentation/reconstruction is done inhardware.If virtual NIC supports TSO/LRO, Inter-VM communication is muchfaster.

25 / 27

I/O virtualization » Inter-VM networking

Optimizations

OS networking stack is responsible for splitting application data topackets (e.g. TCP segmentation) and adding appropriate headersVMM sees many small packets and switch does many smallmemcpy()sReceiver’s networking stack strips packet headers and combinesthe payload to larger data chunks for application.

Segmentation is not necessary for Inter-VM communication(overhead)!Modern NICs support TCP Segmentation Offload (TSO)/LargeReceive Offload (LRO): Segmentation/reconstruction is done inhardware.If virtual NIC supports TSO/LRO, Inter-VM communication is muchfaster.

25 / 27

Summary

Outline

1 Introduction

2 Hardware assisted virtualization

3 Example: Mini VMM with KVM

4 I/O virtualizationDevice emulationVirtioPCI pass-throughSingle-Root I/O VirtualizationInter-VM networking

5 Summary

26 / 27

Summary

Summary

Virtualization is just “another layer of indirection” and as such itadds overheadsIt is useful to know where the overheads are and how to mitigatethem

27 / 27