design and verification of a lightweight reliable virtual machine monitor for a many-core...

10
Front. Comput. Sci., 2013, 7(1): 34–43 DOI 10.1007/s11704-012-2084-0 Design and verication of a lightweight reliable virtual machine monitor for a many-core architecture Yuehua DAI, Yi SHI , Yong QI, Jianbao REN Peijian WANG School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China c Higher Education Press and Springer-Verlag Berlin Heidelberg 2012 Abstract Virtual machine monitors (VMMs) play a central role in cloud computing. Their reliability and availability are critical for cloud computing. Virtualization and device emu- lation make the VMM code base large and the interface be- tween OS and VMM complex. This results in a code base that is very hard to verify the security of the VMM. For exam- ple, a misuse of a VMM hyper-call by a malicious guest OS can corrupt the whole VMM. The complexity of the VMM also makes it hard to formally verify the correctness of the system’s behavior. In this paper a new VMM, operating sys- tem virtualization (OSV), is proposed. The multiprocessor boot interface and memory conguration interface are virtu- alized in OSV at boot time in the Linux kernel. After booting, only inter-processor interrupt operations are intercepted by OSV, which makes the interface between OSV and OS sim- ple. The interface is veried using formal model checking, which ensures a malicious OS cannot attack OSV through the interface. Currently, OSV is implemented based on the AMD Opteron multi-core server architecture. Evaluation re- sults show that Linux running on OSV has a similar perfor- mance to native Linux. OSV has a performance improvement of 4%–13% over Xen. Keywords virtual machine monitor, model, operating sys- tem, many core, formal verication 1 Introduction Virtualization has been widely used in cloud computing. With Received March 14, 2012; accepted June 16, 2012 E-mail: [email protected] the help of virtual machine monitors (VMMs), cloud comput- ing can provide dynamic on-demand services for users at an attractive cost. In order to host multiple instances of an oper- ating system in a resource limited server, traditional VMMs have to virtualize the limited resources, which makes the sys- tem much more complex [1,2]. For example, Xen has about 200 k lines of code in the hypervisor itself, and over 1 M lines in the host OS. The I/O devices to be shared between the virtual machines are virtualized by the VMM, which greatly contributes to the size of the code base of the VMM. The large code base and complex hyper-call interfaces for the VMM make it hard to give a formal verication for the reliability of the VMM. The best formal verication techniques available today are only able to handle around 10 k lines of code [3]. Furthermore, the complex code of a VMM also makes it hard for shipping bug-free code [4]. A malicious guest OS in the VMM can exploit these bugs to attack the VMM and gain the control access of the whole VMM. The potential risks make many companies hesitant about moving to the cloud [5]. On the other hand, the hardware infrastructure for cloud computing is very powerful. For example, servers with tens of processor cores and hundreds gigabytes of RAM are com- mon today. The trend indicates that computers with hundreds of cores will appear in the near future [6]. The computing re- sources in a computer will be abundant. The virtualization of these computing resources is not substantial. In this paper, we present a VMM model governor. The gov- ernor removes the virtualization layer of the current VMM. By preallocating resources to operating systems, the perfor- mance overhead of the VMM can be reduced. An implemen- tation of the governor model, OSV, is made. Formal verica-

Upload: peijian-wang

Post on 11-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

Front. Comput. Sci., 2013, 7(1): 34–43

DOI 10.1007/s11704-012-2084-0

Design and verification of a lightweight reliable virtual machinemonitor for a many-core architecture

Yuehua DAI, Yi SHI , Yong QI, Jianbao REN�Peijian WANG

School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China

c© Higher Education Press and Springer-Verlag Berlin Heidelberg 2012

Abstract Virtual machine monitors (VMMs) play a central

role in cloud computing. Their reliability and availability are

critical for cloud computing. Virtualization and device emu-

lation make the VMM code base large and the interface be-

tween OS and VMM complex. This results in a code base

that is very hard to verify the security of the VMM. For exam-

ple, a misuse of a VMM hyper-call by a malicious guest OS

can corrupt the whole VMM. The complexity of the VMM

also makes it hard to formally verify the correctness of the

system’s behavior. In this paper a new VMM, operating sys-

tem virtualization (OSV), is proposed. The multiprocessor

boot interface and memory configuration interface are virtu-

alized in OSV at boot time in the Linux kernel. After booting,

only inter-processor interrupt operations are intercepted by

OSV, which makes the interface between OSV and OS sim-

ple. The interface is verified using formal model checking,

which ensures a malicious OS cannot attack OSV through

the interface. Currently, OSV is implemented based on the

AMD Opteron multi-core server architecture. Evaluation re-

sults show that Linux running on OSV has a similar perfor-

mance to native Linux. OSV has a performance improvement

of 4%–13% over Xen.

Keywords virtual machine monitor, model, operating sys-

tem, many core, formal verification

1 Introduction

Virtualization has been widely used in cloud computing. With

Received March 14, 2012; accepted June 16, 2012

E-mail: [email protected]

the help of virtual machine monitors (VMMs), cloud comput-

ing can provide dynamic on-demand services for users at an

attractive cost. In order to host multiple instances of an oper-

ating system in a resource limited server, traditional VMMs

have to virtualize the limited resources, which makes the sys-

tem much more complex [1,2]. For example, Xen has about

200 k lines of code in the hypervisor itself, and over 1 M

lines in the host OS. The I/O devices to be shared between the

virtual machines are virtualized by the VMM, which greatly

contributes to the size of the code base of the VMM. The large

code base and complex hyper-call interfaces for the VMM

make it hard to give a formal verification for the reliability of

the VMM. The best formal verification techniques available

today are only able to handle around 10 k lines of code [3].

Furthermore, the complex code of a VMM also makes it hard

for shipping bug-free code [4]. A malicious guest OS in the

VMM can exploit these bugs to attack the VMM and gain the

control access of the whole VMM. The potential risks make

many companies hesitant about moving to the cloud [5].

On the other hand, the hardware infrastructure for cloud

computing is very powerful. For example, servers with tens

of processor cores and hundreds gigabytes of RAM are com-

mon today. The trend indicates that computers with hundreds

of cores will appear in the near future [6]. The computing re-

sources in a computer will be abundant. The virtualization of

these computing resources is not substantial.

In this paper, we present a VMM model governor. The gov-

ernor removes the virtualization layer of the current VMM.

By preallocating resources to operating systems, the perfor-

mance overhead of the VMM can be reduced. An implemen-

tation of the governor model, OSV, is made. Formal verifica-

Page 2: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

Yuehua DAI et al. Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture 35

tion results show that the VMM is immune to attacks made

by malicious guest OSes through the interfaces between OSV

and OS. In particular, there are three contributions we make

in this paper.

First, a reliable and micro-VMM model for many-core is

proposed. The model avoids virtualization of devices by us-

ing existing distributed protocols. This reduces the complex-

ity interface between the VMM and OS. With pre-allocated

processor cores and memory on the many core platforms, the

guest-OS can access them directly, which improves the over-

all performance of the guest-OS.

Second, we explore the implications of applying the model

to a concrete VMM implementation and present an instance

of governor, OSV. Compared to other solutions OSV is rela-

tively small and portable with only about 8 000 lines of code.

Our evaluation results show that OSV has a performance im-

provement of 4%–13% over Xen.

Finally, a formal reliability verification of OSV is given.

Due to the large code base and complex interfaces, traditional

VMMs are rarely formally verified. By using the SPIN model

checker, a formal model for OSV is constructed. The verifi-

cation results show that OSV is secure and reliable.

In Section 2 we discuss related works on the reliability of

VMMs and introduce our motivation. Our proposed gover-

nor model is discussed in Section 3. A formal verification of

OSV is given in Section 4. The performance of OSV is also

discussed in Section 4. Finally, we discuss the limitations and

future work in Section 5 and conclude in Section 6.

2 Related work

Although based on a new viewpoint, the governor model is

related to much previous work on both VMMs and operating

systems.

In order to reduce the complexity of operating systems,

micro-kernel operating systems such as exokernel [7], and

corey [6], were proposed. These systems can be tuned for

performance and reliability. Multikernel [8] is another type

of operating system, which treats a multi-core computer as a

distributed system and uses message passing as its basic op-

eration.

As VMMs becomes much more complex, some work has

been done on simplifying the VMM. One example is SecVi-

sor [9], a tiny VMM, which can support a single guest-OS

and is used for the protection of the operating system ker-

nel. TrustVisor [10] is designed to protect the code and data

of higher layer applications. Like SecVisor, TrustVisor can

only run a single guest-OS in a protection domain. NoHype

[11] is similar to our work which removes the virtualization

layer of a traditional VMM. With pre-allocated resources, it

can host many instances of an operating system. But there

are no interactions between NoHype and a running guest OS.

It sacrifices the flexibility of the VMM, which is important

for cloud computing. BitVisor [12] is a lightweight VMM

which encrypts the data communication between the I/O de-

vices and the guest-OS. In order to protect private data pass-

ing between the guest-OS and VMM, a nested VMM, Cloud-

Visor [13], was proposed. CloudVisor runs at a higher priv-

ilege level than a traditional VMM, like Xen. With the help

of CloudVisor, the guest-OS can keep its data private, from

being leaked to Xen. Nova [14] is a micro-kernel like VMM.

There is an additional layer called the user level VMM in

Nova. The user level VMM is used to provide the services

needed by other guest-OSes. The interfaces between NOVA

and guest-OS are as complex as a traditional VMM. Despite

much work on micro-kernel VMM, the interfaces in these

models cannot be well verified.

There has been some work on formal verification of system

software, such as seL4 [15]. seL4 is a micro-kernel operating

system, which is well verified from design to implementa-

tion. Franklin et al. [16] built a formal model for SecVisor.

This model focuses on whether the design principles can pro-

tect the operating system kernel. However, our work is more

focused on the reliability on the VMM itself.

Micro kernels and lightweight VMMs reduce the attack-

able surface and increase the trust computing base (TCB).

But the lightweight VMMs mentioned above lose some func-

tionality of traditional VMMs. NoHype reduces the attack-

able surface by removal of interaction between the OS and

VMM. It is based on Xen and needs some new hardware

architecture support. Xen also contributes to the large TCB

of NoHype. In this paper, we propose OSV VMM based

on current X86 processors. OSV removes the virtualization

layer, allocates resources with non-uniform memory access

(NUMA) and has a verified interface between VMM and

OSV. These mean OSV has a limited attackable surface and

small TCB.

3 Design and implementation

In this section we present our VMM architecture for multi-

core machines, which we call the governor model. In a nut-

shell, we construct the VMM as a resource guidance tool that

allocates the resource to the OSes, guards access to these

resources, and uses distributed protocols instead of virtual-

ization to multiplex the devices. The design of the governor

Page 3: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

36 Front. Comput. Sci., 2013, 7(1): 34–43

VMM model is guided by three design principles:

1) Make each OS access resources directly.

2) Affinity aware allocation of resources to OSes.

3) Utilize existing distributed protocols for resource shar-

ing between OSes.

These principles allow the VMM to benefit from a

lightweight system and achieve safety, and flexibility with a

small code base. Furthermore, protocols and systems devel-

oped for existing distributed systems can also be reused in the

VMM based on the governor model.

After discussing the principles in detail, we explore the im-

plications of these principles by describing the implementa-

tion of OSV, which is a new VMM based on the governor

model.

3.1 Application of design principles

3.1.1 Make each OS access resources directly

Within the governor VMM, all resources allocated to OSes

can be accessed directly without interceptions by the VMM.

No resource is shared between OSes running in the VMM,

except for some interrupts and memory used for OSes to

communicate with each other. OSes directly accessing the

resources can offload virtualization of the VMM, which can

simplify the VMM.

Virtualization used in traditional VMM isolates the OSes

from each other and shares the resources between OSes. In

order to run as many instances of an OS on a resource lim-

ited machine, a traditional VMM virtualizes the resources, so

that several OSes can access the same resources at the same

time. For example, the scheduler in Xen allows different OSes

to run on the same processor core. The OSes access the re-

sources in an exclusive manner, so the VMM must isolate

them from each other. In order to protect the OSes from each

other, the VMM must virtualize an advanced programmable

interrupt controller (APIC) and machine specific registers in

a modern CPU. Therefore, XEN VMM contains an APIC.

Modern processors take into account efficient virtualization.

Extra privilege levels and additional page table management

has been added to the processors. These technologies can im-

prove performance and reduce the complexity of a VMM.

But the virtualization for APIC, devices and memory is still

needed, which complicates the VMM, and increases the code

base of the VMM.

In order to reduce the complexity of the VMM, an OS

should manage the resources by itself. This approach enables

the OS to provide more reliable resource management. The

resources allocated to an OS can only be accessed by itself.

This means that the applications running in an OS will not

be affected by applications running in another OS. Further-

more, this approach makes the VMM more reliable. A tradi-

tional VMM intercepts some operations of the OS, such as

the syscall instruction that is issued by an application for a

system call. Attackers may violate the VMM through these

interfaces [11,17], by which they can gain control of all the

OSes running in the VMM. Running the OS directly on the

resources allocated to it reduces these attack interfaces and

makes the system more reliable.

Finally, running the OS directly on hardware in the VMM

promises good service for cloud computing [11]. Quality

of service is very important in cloud computing. Users can

subscribe a virtual machine from a cloud coputing provider.

OSes running in a traditional VMM will affect each other

and reduce the performance when request the same resources

from the VMM. This has a negative impact for QoS. A VMM

based on direct resource access provides good performance

isolation for OSes and QoS.

3.1.2 Affinity aware allocation of resources to OSes

Processor cores and memory are critical to the performance

of an OS. The communication latency between processor

cores is different to the latency between processor cores and

memory depending on the topology of the system. Proces-

sor cores that share cache have the lowest communication la-

tency. Accessing memory connected to the processor core’s

native memory controller is faster than remote access [6].

An affinity aware governor VMM allocates resources to

OS. Processor cores allocated to a single OS must share cache

(contemporary processor cores in a die share L3 cache). The

memory allocated to one OS should lie in the same NUMA

node. This has several important potential benefits.

Firstly, allocating processor cores and memory with loca-

tion awareness will improve the communication performance

of the OS and reduce the latency of accessing the memory. In

particular, inter-core communication can be more efficient.

This is helpful for the performance of the OS.

Secondly, OSes running on processor cores sharing the

same cache will improve the hit rate of the cache. The cache

will not be polluted by applications from other OSes. In this

way, the shared data in an OS are stored in the cache shared

by the processor cores. If one processor modifies the data, the

operation is performed in the shared cache. This avoids cache

migrations. Frequent cache migrations will cause cache Ping-

Pong, which is harmful to application performance.

Page 4: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

Yuehua DAI et al. Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture 37

Thirdly, this approach makes the VMM suitable for het-

erogeneous processor architectures. For heterogeneous pro-

cessor architectures, the GPU or DSP cores are now moved

into the processor die. Both GPU and CPU cores have their

own local or shared memory for their own use. The mem-

ory allocation policy to these cores has the most significant

impact on the performance. With an affinity aware allocation

policy, the performance will be improved.

3.1.3 Utilize existing distributed protocols for resource

sharing between OSes

Operating systems running in the VMM need to share re-

sources with each other, such as disks, memory, processors,

and so on. A traditional VMM provides a specific driver for

each device, with which an OS can share the device. A device

driver has two parts: one daemon server running in the privi-

leged OS for processing requests and driver clients in guest-

OSes for sending requests to the server, such as the Xen front

end and back end driver model. This driver model increases

the complexity of the VMM.

Distributed protocols are common in modern OSes for

sharing resources in a distributed environment. These proto-

cols are used to work in a networked system. The governor

can use these protocols for device sharing among the OSes

by providing a virtual network interface card (NIC). In gen-

eral, an OS can access the devices exported by another OS

through such distributed protocols. The security and reliabil-

ity of the distributed system can be inherited by the VMM.

Most of the devices in the machine can be exported by

existing distributed protocols. The disks can be shared via

the network file system (NFS). The keyboard, mouse, and

displays can be shared through Virtual Network Comput-

ing (VNC). The protocol for accessing remote systems’ 3D

graphical card is under development [18]. All these protocols

are common in recent OSes, such as Linux and Windows.

The performance of devices accessed via distributed pro-

tocols can be optimized by the virtual NIC. The virtual NIC

can achieve low latency and high bandwidth, also the pro-

tocols can be optimized. In traditional distributed environ-

ments, the network is not reliable. But, within the governor,

the virtualized network is reliable. So, the protocols can be

simplified for better performance. By using existing proto-

cols, the governor can be small and flexible. Furthermore,

without the need for device drivers, the governor cannot be

attacked through the device drivers. This can make the gov-

ernor more reliable and secure.

3.1.4 Applying the model

Like all models, the governor, while theoretically elegant, has

an idealist position: no device should be virtualized and the

governor should never intercept the OS. This has several im-

plications for a real VMM.

As we discussed previously the OS manages the processor

cores using the logical APIC ID and physical ID. The phys-

ical ID is unique in the machine, but different OS can use

the same logical ID for different processor cores. When one

OS sends an inter-processor interrupt (IPI) using a logical ID,

other OSes will be confused and will misbehave. However, an

idealist governor model does not intercept the OS.

In order to make the OS work correctly with logical IDs,

the VMM must intercept the OS operations of managing the

processor core IDs or modifying the source code of the OS.

The VMM must replace the logical ID with the physical ID

in the IPI. For each OS, a list must be managed by the VMM

for mapping the logical ID to corresponding physical ID. The

logical ID for a CPU in an OS is not changed once the sys-

tem is initialized. Thus, there is little overhead for managing

the mapping list. On the other hand, when sending an IPI, the

OS has to trap into the VMM for changing the logical ID into

physical ID, which induces some latency for the IPI.

From a research perspective, a legitimate question is to

what extent a real implementation can adhere to the model,

and the consequent effect on system performance and reli-

ability. We are implementing OSV, a substantial prototype

VMM structured according to the governor model. The goals

for OSV are:

• To demonstrate the approach for a governor model in

current x86 architecture;

• To make the VMM lightweight, more reliable, and se-

cure;

• To utilize existing software for sharing the devices with-

out new drivers.

3.2 Implementation

The OSV VMM is implemented as a multithreaded program

running on multi-core processors. The current implementa-

tion of OSV is based on AMD processors. A port to Intel

processors is in progress. OSV differs from existing systems

in that it pays more attention to resource isolation rather than

virtualizing these resources. For example, OSV does not con-

tain any structure for virtualizing the processor cores and

main memory. The code size for OSV is about 8 000 lines.

Page 5: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

38 Front. Comput. Sci., 2013, 7(1): 34–43

It is easy to tune this code line by line. The overall architec-

ture of OSV is shown in Fig. 1. The current implementation is

based on AMD Opteron processors with SVM [19] technol-

ogy support and can support multiple Linux operating system

instances. The components used by OSV to host multiple op-

erating systems are as follows:

• NUMA nodes Each operating system can access the

memory belonging to a NUMA node. The memory in

other nodes are invisible to the OS. This is initialized in

the E820 memory map.

• Paging For the privileged operating system, it works

the same as in bare metal. For other OSes, OSV man-

ages their memory using a nested paging table (NPT).

The NPT manages the mapping of OS physical ad-

dresses to machine physical addresses. In the AMD pro-

cessor, the NCR3 register is used to store the NPT ad-

dress and the processor computes the mapping automat-

ically as in traditional page mapping. The NPT is initial-

ized based on the physical memory allocated to the OS.

Thus, the OS does not cause any NPT page fault during

its running period. If the OS demands more memory, it

must request it from the OSV. The OSV prepares the

newly allocated pages in the NPT, and the OS accesses

these pages using mmap.

• Multi-Processor The processor cores are allocated to

an OS in a NUMA node fashion. If an OS demands

more cores than a NUMA node, the OSV allocates all

cores in a NUMA node, then extra cores from another

NUMA node. The memory allocated to an OS is from

the NUMA node which has the most cores allocated to

the OS. This can reduce the remote cache access and

memory access latency.

• Interrupt All I/O interrupts are delivered to the privi-

leged OS. Other OSes access the I/O through distributed

protocols.

• Timer The external timer interrupt is dispatched by

the privileged OS through the IPI.

• Network interface card The privileged OS controls

all the network cards. A virtual network card (VNIC)

is provided to each OS. The VNIC is based on shared

memory. OSV allocates a memory region, and each

OS maps the memory region into its address space us-

ing mmap. And, OSV constructs the mapping for these

pages in the corresponding NPT. When transmitting the

data, the OS can read and write the memory region di-

rectly without accessing the OSV. The OS polls its data

using a timer.

• Disks, etc. These devices are exported as services by

the privileged OS. Other OSes can access these services

through standard distributed protocols.

Fig. 1 The architecture of the OSV VMM

3.3 Booting Procedure

Currently, OSV can run up to 32 Linux kernels concurrently

on a 32-core server. OSV boots Linux using a 32-bit boot pro-

tocol the Linux defined. A boot_params structure is defined in

Linux for a 32-bit boot protocol. OSV provides boot params

for each Linux instance. The boot_params structure includes

the memory information and kernel information. OSV fills

up the memory information in the boot_params. Linux reads

the configuration information of the virtual machine from

boot_params and then initializes its internal structures. Linux

kernels will only access memory based on the boot_params

information.

Multi-processors are supported in commodity operating

systems. The linux kernel gets the multi-processor informa-

tion from either ACPI or the Intel Multi-processor specifica-

tion. The multi-processor boot and initialization in Linux are

based on this information. OSV provides a predefined Multi-

processor configuration table for each Linux. OSV configures

the table based on the CPU cores allocated to the Linux OS.

The unallocated CPU cores are masked in this table. So, the

OS will only initialize those CPU cores assigned to it. Then,

the kernel can now boot and run applications.

Multi-processor boot in X86 is based on IPI. The booting

processor (BP) sends an INIT IPI to the application processor

(AP). If the AP gets the INIT IPI, the processor will be reset

and starts work from a dedicated address. If OSV directly

uses the INIT IPI, the reset operation of processor will clear

all the internal states initialized by OSV. Then OSV will lose

control of the processor. Thus, we intercept the INIT IPI in

Page 6: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

Yuehua DAI et al. Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture 39

OSV. The INIT IPI now is redirected into an exception. Once

OSV catches the exception, OSV will initialize the processor

for Linux and set the IP register to the address corresponding

to the Linux multi-processor booting code. After these oper-

ations, the AP will execute the Linux kernel.

4 Modeling and evaluation of OSV

In this section we present an overview of model for the OSV,

and detail our formal OSV model, including the state transi-

tion system and isolation properties.

4.1 Overview of the model

The memory and interrupt system are the only two interfaces

used in OSV for communication between OSes. The reliabil-

ity of these two systems is critical for the reliability of the

VMM. In this context, we first describe the relevant memory

protection data structures, the entities that use and manipulate

these data structures, and the functional relationship between

these entities. For the interrupt system, the model focuses

on the inter-processor interrupt (IPI). This is because in the

governor model, the IPI is the only interrupt source that can

be dispatched by the OS running in OSV, whereas external

interrupts are all dedicated to specified OS. The IPI protec-

tion data structures and functional relationship between these

structures are also described.

4.2 Modeling data

We use the SPIN [20] model checker for our OSV VMM.

SPIN uses PROMELA as its verification language. In order

to model the hardware platform, memory mapping, and other

data structures, a number of basic data types and macros are

defined.

The system is modeled as a record data type containing

physical memory, a CPU mode corresponding to either VMM

or Guest mode�an ip pointer to the instruction address and

an integer specifies the guest-OS id. All capitalized words are

pre-defined integer constants.

typedef sys

{

bit mode;

unsigned ip : 6;

byte osid;//Address id,used for identify the

//guest id,0 for VMM;

};

The physical memory is modeled as an array of bits, one

bit for each page. We model virtual memory at the granular-

ity of pages with the page table entry (PTE). A PTE includes

a read/write bit, execute bit, and the physical address of the

entry. The nested page tables (NPT) used by OSV to manage

the guest-OS memory are modeled as an array of PTE entries.

bit mem[MEM_INDEX];

Some macros are pre-defined to specify the system’s state.

Such as follows:

#define r (sys.mode==GUEST_MODE)

#define p (mem[VMM_CODE]==1)

#define q (mem[VMM_DATA]==1)

#define w (sys.mode==VMM_MODE)

When “r” is defined as the system in GUEST_MODE, and

“w” is for the system in VMM_MODE. The VMM code be-

ing accessed is denoted by as “p”, and data being accessed is

denoted by “q”.

4.3 Memory system

A traditional VMM provides complex interfaces to the guest

OS. A malicious guest-OS may violate the VMM through

these interfaces, so that it can then execute malicious code

with VMM privileges. The interfaces between OSV and

guest OS are only 4 simple hypercalls: get_did, mem_map,

send_ipi, guest_run, and 1 exception handler: npt_fault. In

this section we give a formal model of hyper-calls and excep-

tion handler, except the send_ipi. The send_ipi is discussed

in the subsequent section.

In the VMM mode, the CPU should not execute any code

in the guest-OS memory region. In addition, the VMM can-

not be modified by the guest-OS. We specify this invariant

as

#define code_invt []( (!(r&&(p||q)))&& (!(w&&p)) )

There are two atomic operations in OSV VMM, vmm_run

and vmm_exit. The vmm_run is called when a transition

to VMM mode occurs. On a transition to guest mode, the

vmm_exit is called. The PROMELA code is listed as follows:

inline vmm_run()

{

computer_sys.mode = VMM_MODE;

computer_sys.ip = VMM_CODE;

computer_sys.osid = 0;

npt[VMM_CODE]=X|R;npt[VMM_DATA]=RW|NO_X;

Page 7: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

40 Front. Comput. Sci., 2013, 7(1): 34–43

npt[OS_CODE]=RW|NO_X;npt[OS_DATA]=RW|NO_X;

}

inline vmm_exit()

{

computer_sys.mode = GUEST_MODE;

npt[VMM_CODE]=NO_RW|NO_X;npt[VMM_DATA]=NO_RW|NO_X;npt[OS_CODE]=R|X;npt[OS1_DATA]=RW|NO_X;

}

The model begins with a call to the init function, which

makes a call to system_run. Then the model launches all

guest-OSes. In order to simplify the model, we only define

two guest-OSes.

The total number of states in the model is 25 483. The ver-

ification results show that no error states occurred, which in-

dicates that code_invt is satisfied in all situations. This means

that malicious OS cannot attack OSV through the interfaces.

4.4 Interrupt system

In OSV, the IPI is used for signals and notification of guest-

OSes. The guest-OSes can also communicate with each other

via the IPI. A misused IPI can cause other guest-OSes to be-

come corrupted or be busy when responding to the IPI which

will result in denial of service (DoS). We enforce that the IPI

can only be sent from and to authorized guest-OSes. In this

way, we can prevent malicious guests from attacking other

guests through IPI interfaces.

In order to model the IPI, additional data types are pre-

defined. Sending IPI in current x86 platform is based on the

advanced programmable interrupt controller (APIC) in pro-

cessor cores. We model the APIC as an array of registers.

The register index of 0x300 is used to send the IPI. The desti-

nation of the IPI is specified in register 0x310. The data types

are as follows:

typedef generl_regs{

unsigned reg [APIC_REG];

unsigned index;

};

#define ltl_r ((regs.index==REG_300))

#define ltl_p des_id

#define ltl_q auth_des_id

We specify that the guest-OS must send the IPI to the pro-

cessor cores corresponding to the authorized OSV VMM.

When a guest-OS writes to the 0x300 register to send the IPI,

the actual destination must be equal to the authorized id of

the OSV VMM. This can be defined as an invariant:

ltl_r {<>[](ltl_p==ltl_q) }

The total number of the states in this model is 26 735.

There is no error state. The results show that the IPI system

is reliable.

4.5 Reliability discussion

A VMM may be compromised through its interfaces or the

bugs in the code base. A large code base of a VMM may have

many bugs, and has a large attackable surface. Software en-

gineers estimate that the density of bugs in production qual-

ity source code is about one to ten bugs in 1 000 lines of

code [4].

OSV removes the virtualization layer of traditional VMMs.

In OSV, device multiplexing is implemented using existing

distributed protocols. OSV eliminates itself from the inter-

ception operations of some privileged instructions. The in-

terfaces between guest OS and OSV are simple, with only

four hyper-calls. All of these approaches reduce the number

of lines of code in OSV. The codebase implementing OSV is

about 8 000 lines of code, while Xen has about 200 k lines of

code. This reduces the attacking surface of OSV.

4.6 Performance evaluation

We measure the overhead of virtualization through a set of

operating system benchmarks. The performance is compared

with Xen and a native Linux kernel. The experiments are per-

formed on two servers. One is a Dell T605 with two quad-

core Opteron 2 350 processors at 2.0 GHZ with 16 GB RAM,

a Broadcom NetXtreme 5 722 NIC and a 146 GB 3.5- inch

15 k RPM SAS Hard Drive. The other is a Sun x4600M2

eight quad-core Opteron 8 478 processors at 2.8 GHZ with

256 GB RAM and 4×256 GB HDD arranged in RAID0. The

Dell machine has two NUMA nodes, each node has a quad-

core processor and 8 GB RAM. The Sun x4600M2 server

has eight NUMA nodes, each node has a quad-core proces-

sor and 32 GB RAM. Linux kernel 2.6.31 is employed, and

compiled for architecture x86_64. NFS version 4.1 is used.

The Opteron processors in the machine support SVM and

NPT which are essential for OSV. In order to measure perfor-

mance we use lmbench [21]. Linux running on OSV and Xen

is allocated with a NUMA node, including processor cores

and memory. For both servers, each NUMA node has four

processor cores.

Page 8: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

Yuehua DAI et al. Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture 41

Figure 2 shows the local communication latency. The

tests evaluate the latency of two processes exchanging data

through three different methods: shared memory, pipe, and

AF UNIX protocol. Xen HVM based guest OS acieve a sim-

ilar performance to OSV kernels and performs better than

Linux and other PVM guests. This is mainly caused by the

NUMA architecture of multi-core server: time cost for local

access is smaller than for remote access. The resources used

by the OSV kernel and Xen HVM are bound to a NUMA

node so that CPU cores only access local memory. Thus, the

performance of OSV and that of Xen PVM guests are bet-

ter than Linux. Xen HVM has a similar latency to Linux and

OSV. This is because the operations for Xen HVM are not

intercepted by Xen. The latency of Xen PVM guests is about

three to seven times that of OSV and Linux. This is because

the Xen PVM is limited by the intervention of Xen when ac-

cessing system resources.

Fig. 2 Local communication latencies

System call latency is critical to application’s performance.

This benchmark tests some common system call latencies:

null system call, I/O system call, file open/close operations,

process fork system call and the system call for execution.

The benchmark results are listed in Fig. 3. Raq linux has the

lowest latency for system calls, while OSV kernel and HVM

guest on Xen have a similar result. The privileged domain,

called domain 0, for OSV kernel has similar performance as

raw linux. The namal domain, called domain 1 of OSV kernel

in the open/close test has very high latency. This is mainly

caused by access to NFS. When opening and closing a file,

the domain 1 needs two more network connections to finish

the job which introduces in high latency. PVM guests on Xen

have high latency in all tests. These are caused by the inter-

vention of Xen when PVM performs some privileged opera-

tions.

Fig. 3 Processor and system call latency

Figure 4 shows SPEC_int benchmark results. The

SPEC_int evaluates the overall performance of a system.

The perl bench tests the performance of the execution of perl

script language. Bzip2 and h264ref tests evaluate the com-

pression speed of the system. Gcc and xalancbmk tests test

the speed of generating code and XML processing. These five

tests are system resources binding tests. Mcf, gobmk, hmmer,

sjeng, and astr are computing intensive. They are used in ar-

tificial intelligence and path searching. Omnetpp is a discrete

event simulation tests. It models a large Ethernet campus net-

work and is computing intensive. The results are normalized

into speedup ratio and and show a relative performance in-

crease/hit of XEN/OSV with native Linux. The grey bar rep-

resents Linux on Xen, and the black Linux on OSV. For most

of the benchmarks, OSV and Xen exhibit some performance

overhead. For perlbench, sjeng, and gobmk benchmarks, Xen

and OSV show some performance improvement. There are

two reasons: A) these benchmarks have few system calls. B)

Both Xen and OSV are configured in a NUMA node. The

native Linux has some accesses across the NUMA nodes’

Fig. 4 The SPEC_int speedup compared to native Linux

Page 9: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

42 Front. Comput. Sci., 2013, 7(1): 34–43

memory. The latencies for accesses across NUMA nodes are

higher than local accesses [6].

For benchmarks with frequent system calls and memory

operations, with the performance of OSV is much better than

that of Xen, such as gcc, omnetpp, and libquantum. The virtu-

alization of system calls in Xen induces significant latencies.

For example, the syscall instruction is intercepted in Xen.

This contributes to the latency of system call.

5 Limitations and future work

The VMM we propose in this paper is mainly focused on

reducing the attackable surface and trust computing base.

OSV removes the virtualization layer of traditional VMMs.

Compared with Xen and VMware, OSV is not capable of

server consolidation. For a single server, OSV runs multiple

OS concurrently with static resource allocation. The interac-

tions between guest OS and OSV are also removed in OSV.

In this way, OSV achieves a similar performance to native

Linux. Thus, OSV is suitable for real time workloads that

have fixed resource demands. Dynamic resource reallocation

and scheduling are not supported in OSV. With the help of

Xen, OSV can provide the OS with more virtual cores and

memory, also with live migration operations and dynamic

scheduling. In order to make OSV more useful in cloud com-

puting, we are porting Xen to run on top of OSV using nested

virtualization technology.

The security problems for VMM are mainly caused by the

complex interfaces between OS [11,22–24]. In this paper, we

have verified the interfaces of OSV. This ensures that the OS

cannot attack OSV through these interfaces.

The reliability of OSV itself is important, and a fully ver-

ified system from design to implementation period can keep

the system bug-free [16]. The internal states of VMM are not

verified. As a future work, we will provide a full formal veri-

fication of OSV to keep the system bug-free and reliable. Fur-

thermore, we will add support for the recovery of corruption

of VMM. This will make the system more robust.

Currently OSV is based on NFS. NFS on OSV is based on

TCP/IP over the VNIC. NFS latency in OSV is caused by the

VNIC. An inter-OS communication socket based on OSV has

been implemented [25]. We will implement the NFS based on

this socket to improve the NFS performance as a future work.

6 Conclusion

The reliability of VMM is a key factor for cloud computing.

In this paper we propose a reliable and tiny VMM model,

governor. An implementation of the governor model, OSV is

given in this paper. The OSV has about 8 000 lines of code.

It has a reduced attackable surface. The formal verification

of OSV shows that the communication interface of OSV is

well defined and safe which can keep the system reliable.

The evaluation results indicate that OSV has a comparable

performance to native Linux. OSV has a performance im-

provement of 4%–13% compared to Xen.

Acknowledgements This work was supported in part by the National Nat-ural Science Foundation of China (Grant Nos. 60933003 and 61272460).

References

1. Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer

R, Pratt I, Warfield A. Xen and the art of virtualization. In: Proceedings

of the 19th ACM Symposium on Operating Systems Principles. 2003,

164–177

2. Understanding Memory Resource Management in VMware ESX

Server. VMWare white paper. www.vmware.com/files/pdf/perf-

vsphere-memory_management.pdf

3. Klein G, Elphinstone K, Heiser G, Andronick J, Cock D, Derrin P,

Elkaduwe D, Engelhardt K, Kolanski R, Norrish M, Sewell T, Tuch H,

Winwood S. seL4: formal verification of an OS kernel. In: Proceedings

of the ACM SIGOPS 22nd Symposium on Operating Systems Princi-

ples. 2009, 207–220

4. Holzmann G J. The logic of bugs. In: Proceedings of Foundations of

Software Engineering. 2002

5. Gens F. IT cloud services user survey, part.2: top benefits & challenges.

http://blogs.idc.com/ie/?p=210

6. Boyd-Wickizer S, Chen H, Chen R, Mao Y, Kaashoek F, Morris R,

Pesterev A, Stein L, Wu M, Dai Y. Corey: an operating system for

many cores. In: Proceedings of the 8th USENIX Conference on Oper-

ating Systems Design and Implementation. 2008, 43–57

7. Engler D, Kaashoek M. Exokernel: an operating system architecture

for application-level resource management. ACM SIGOPS Operating

Systems Review, 1995, 29(5): 251–266

8. Baumann A, Barham P, Dagand P, Harris T, Isaacs R, Peter S, Roscoe

T, Schupbach A, Singhania A. The multikernel: a new OS architecture

for scalable multicore systems. In: Proceedings of the ACM SIGOPS

22nd Symposium on Operating Systems Principles. 2009, 29– 44

9. Seshadri A, Luk M, Qu N, Perrig A. SecVisor: a tiny hypervisor to pro-

vide lifetime kernel code integrity for commodity OSes. ACM SIGOPS

Operating Systems Review, 2007, 41(6): 335–350

10. McCune J M, Li Y, Qu N, Zhou Z, Datta A, Gligor V, Perrig A. TrustVi-

sor: efficient TCB reduction and attestation. IEEE Symposium on Se-

curity and Privacy. 2010, 143–158

11. Keller E, Szefer J, Rexford J, Lee R B. NoHype: virtualized cloud

infrastructure without the virtualization. ACM SIGARCH Computer

Architecture News, 2010, 38(3): 350–361

12. Shinagawa T, Eiraku H, Tanimoto K, Omote K, Hasegawa S, Horie T,

Hirano M, Kourai K, Oyama Y, Kawai E. BitVisor: a thin hypervi-

Page 10: Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture

Yuehua DAI et al. Design and verification of a lightweight reliable virtual machine monitor for a many-core architecture 43

sor for enforcing i/o device security. In: Proceedings of the 2009 ACM

SIGPLAN/SIGOPS International Conference on Virtual Execution En-

vironments. 2009, 121–130

13. Zhang F, Chen J, Chen H, Zang B. CloudVisor: retrofitting protection

of virtual machines in multi-tenant cloud with nested virtualization. In:

Proceedings of the 23rd ACM Symposium on Operating Systems Prin-

ciples. 2011, 203–216

14. Steinberg U, Kauer B. NOVA: a microhypervisor-based secure virtual-

ization architecture. In: Proceedings of the 5th European Conference

on Computer Systems. 2010, 209–222

15. Klein G, Elphinstone K, Heiser G, Andronick J, Cock D, Derrin P,

Elkaduwe D, Engelhardt K, Kolanski R, Norrish M. seL4: formal ver-

ification of an OS kernel. In: Proceedings of the ACM SIGOPS 22nd

Symposium on Operating Systems Principles. 2009, 207–220

16. Franklin J, Seshadri A, Qu N, Chaki S, Datta A. Attacking, repairing,

and verifying SecVisor: a retrospective on the security of a hypervisor.

Technical Report CMU-CyLab-08-008. 2008

17. Wang Z, Jiang X. Hypersafe: a lightweight approach to provide life-

time hypervisor control-flow integrity. IEEE Symposium on Security

and Privacy (SP). 2010, 380–395

18. Ravi V, Becchi M, Agrawal G, Chakradhar S. Supporting GPU shar-

ing in cloud environments with a transparent runtime consolidation

framework. In: Proceedings of the International Symposium on High-

Performance Parallel and Distributed Computting. 2011

19. AMD. Amd64 architecture programmers manual volume 2: system

programming. 2007

20. Holzmann G J. The model checker SPIN. IEEE Transactions on Soft-

ware Engineering, 1997, 23(5): 279–295

21. McVoy L, Staelin C. Lmbench: portable tools for performance analy-

sis. In: Proceedings of the 1996 Annual Conference on USENIX An-

nual Technical Conference. 1996, 23

22. Kortchinsky K. Hacking 3D (and breaking out of VMWare). In: Pro-

ceedings of Black Hat conference. 2009

23. Wojtczuk R, Rutkowska J. Xen Owning trilogy. In: Proceedings of

Black Hat conference. 2008

24. Secunia. Xen multiple vulnerability report. http://secunia.com/

advisories/44502/

25. Ren J, Qi Y, Dai Y, Xuan Y. Inter-domain communication mechanism

design and implementation for high performance. In: Proceedings of

the 4th International Symposium on Parallel Architectures, Algorithms

and programming (PAAP). 2011, 272–276

Yuehua Dai received his BS in com-

puter software and theory from Xi’an

Jiaotong University in 2004. He is cur-

rently a PhD candidate in computer

science at Xi’an Jiaotong University.

His research interests include operating

systems, VMM, cloud computing and

system security.

Yi Shi received her PhD in computer

software and theory from Xi’an Jiao-

tong University in 2008. She is a lec-

turer in the School of Electronic and In-

formation Engineering, Xi’an Jiaotong

University. Her research interests in-

clude operating systems, network secu-

rity, cloud computing, and VMM.

Yong Qi received his PhD in computer

software and theory from Xi’an Jiao-

tong University in 2001. He is cur-

rently a professor in the School of

Electronic and Information Engineer-

ing, Xi’an Jiaotong University and the

director of the Institute of Computer

Software and Theory. His research in-

terests include operating systems, distributed systems, pervasive

computing, software aging and VMM. He has published more than

80 papers in international conferences and journals, including ACM

SenSys, IEEE PerCom, ICNP, ICDCS, ICPP, IEEE TMC, and IEEE

TPDS.

Jianbao Ren received his BS in com-

puter software and theory from Xi’an

Jiaotong University in 2009. He is cur-

rently a PhD candidate in computer

science at Xi’an Jiaotong University.

His research interests include operating

systems, VMM, cloud computing, and

system security.

Peijian Wang received the BS in computer software and theory from

Xi’an Jiaotong University in 2004. He is currently a PhD candi-

date in computer science at the same university. His research inter-

ests include power management, cloud computing, and Internet data

center.