a high performance inter-domain communication approach for virtual machines

A

Ya

b

c

a

ARRAA

KVIOC

1

1aihwce(lti(anoa

0h

The Journal of Systems and Software 86 (2013) 367– 376

Contents lists available at SciVerse ScienceDirect

The Journal of Systems and Software

jo u rn al hom epage: www.elsev ier .com/ locate / j ss

high performance inter-domain communication approach for virtual machines

uebin Baia,c,∗, Yao Maa, Cheng Luob, Duo Lva, Yuanfeng Penga

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, ChinaDepartment of Computer Science, The University of Tokyo, Tokyo, JapanScience and Technology on Information Transmission and Dissemination in Communication Networks Laboratory, Shijiazhuang 050081, China

r t i c l e i n f o

rticle history:eceived 28 September 2011eceived in revised form 15 August 2012ccepted 23 August 2012vailable online 1 September 2012

eywords:irtualization

nter-VM communicationverheadommunication path

a b s t r a c t

In virtualization technology field, researches mainly focus on strengthening the isolation barrier betweenvirtual machines (VMs) that are co-resident within a single physical machine. At the same time, thereare many kinds of distributed communication-intensive applications such as web services, transactionprocessing, graphics rendering and high performance grid applications, which need to communicate withother virtual machines at the same platform. Unfortunately, current inter-VM communication methodcannot adequately satisfy the requirement of such applications. In this paper, we present the designand implementation of a high performance inter-VM communication method called IVCOM based onXen virtual machine environment. In para-virtualization, IVCOM achieves high performance by bypass-ing some protocol stacks and privileged domain, shunning page flipping and providing a direct andhigh-performance communication path between VMs residing in the same physical machine. But in
full-virtualization, IVCOM applies a direct communication channel between domain 0 and Hardware Vir-tualization based VM (HV2M) and can greatly reduce the VM entry/exit operations, which has improvedthe HV2M performance. In the evaluation of para-virtualization consisting of a few of benchmarks, weobserve that IVCOM can reduce the inter-VM round trip latency by 70% and increase throughput by up to3 times, which prove the efficiency of IVCOM in para-virtualized environment. In the full-virtualized one,IVCOM can reduce 90% VMX transition operations in the communication between domain 0 and HV2M.
© 2012 Elsevier Inc. All rights reserved.

. Introduction

Virtual machines (VMs) technology was introduced in the960s, and reached prominence in the early 1970s. It can cre-te virtual machines which can provide function and performancesolation across applications and services that share a commonardware platform. At the same time, VMs can improve the system-ide utilization efficiency and provide lower overall operation

ost of the system. With the advent of low-cost minicomput-rs and personal computers, the need for virtualization declinedFigueiredo et al., 2005). As growing interest in improving the uti-ization of computing resources through server consolidation, VMechnology is regaining the spotlight again and is widely usedn many fields. Now, there are many virtual machine monitorsVMMs) such as VMware (Sugerman et al., 2001), Virtual PC, UML,nd Xen (Barham et al., 2003). Among them, Xen develops a tech-
ique known as para-virtualization (Whitaker et al., 2002), whichffers virtualization with low-overhead and has attracted muchttention from both the academic VM and the enterprise market.
∗ Corresponding author. Tel.: +86 10 8233 9020.E-mail addresses: [email protected] (Y. Bai), [email protected] (Y. Ma).

164-1212/$ – see front matter © 2012 Elsevier Inc. All rights reserved.ttp://dx.doi.org/10.1016/j.jss.2012.08.054

However, para-virtualized approach has its intrinsic shortcomingthat it has to modify the OS kernel on it to recuperate the proces-sor’s virtualization holes. To implement full-virtualization (Adamsand Agesen, 2006) on x86 platform, some processor manufactur-ers propose their own hardware assisted technologies to supportfull virtualization such as Intel’s VT technology, and AMD’s Pacificatechnology.

In spite of the recent advance in the VM technology, virtualnetwork performance remains a major challenge (Menon et al.,2006). Some researches done by Menon et al. (2005) show thatLinux guest domain has far lower network performance than nativeLinux in the scenarios of inter-VM communication. The communi-cation performance between two processes in their own VMs onthe same physical machine is even worse than expected, whichis mainly due to the virtualization technology’s central character-istic of isolation. For example, a distributed HPC application mayhave two processes running in different VMs that need to com-municate using messages over MPI libraries. Another example isnetwork transaction. In order to satisfy a client transaction request,
a web service running in one VM may need to communicate with adatabase server which is running in another VM. Even routine inter-VM communication, such as file transfers, may need to cross theisolation barrier frequently. In the examples above, we would like
dx.doi.org/10.1016/j.jss.2012.08.054

http://www.sciencedirect.com/science/journal/01641212

http://www.elsevier.com/locate/jss

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.jss.2012.08.054

3 ems an

twt

tnmtn5cvpsc

2

vcsaco

ahooavcp

tica

cfrTple

andtorIMisodhUr

rv

68 Y. Bai et al. / The Journal of Syst

o use a direct and high performance communication mechanism,hich can minimize the communication latency and maximize the

hroughput.In this paper, basing on the research about virtualization

echnology and multi-core technology, we combine the two tech-ologies and propose a high performance inter-VM communicationethod. The rest of this paper is organized as follows: Motivation of

his work is described in Section 2. Section 3 gives a brief view of Xenetwork background. Section 4 presents the related work. Section

presents the design and implementation of inter-VM communi-ation method (IVCOM) as well as the extension to hardware-basedirtual machine. Section 6 discusses the overhead and the detailederformance evaluation of IVCOM in para-virtualization. Section 7hows the experiments in full-virtualization. Section 8 draws theonclusion.

. Motivation

While enforcing isolation is an important requirement from theiewpoint of security of individual software components, it alsoan result in significant communication overhead because differentoftware components also need to communicate with each othercross the isolation barrier to achieve application objectives in spe-ific scenarios. We take Xen as an example, and analyze it to findut the reason for communication performance loss.

Xen is an open source hypervisor running between hardwarend operating systems (OS). It virtualizes all resources over theardware and provides the virtualized resources to OS runningn Xen. Each OS is called a guest domain or domain U, and thenly one privileged OS for hosting the application level man-gement software is called domain 0. Now, Xen supports twoirtualization ways: full-virtualization and para-virtualization. Itan provide virtual machine performance close to a native one byara-virtualization.

In para-virtualization, Xen exports virtualized network deviceso each domain U replacing the actual network drivers that cannteract with the real network card within domain 0. Domain 0ommunicates with Domain U by means of a split network driverrchitecture shown in Fig. 1.

The domain 0 hosts the backend of the split network driveralled netback, and the domain U hosts the frontend called net-ront. They interact by using high-level network device abstractionather than low-level network hardware specific mechanisms.he split drivers communicate with each other through tworoducer–consumer ring buffers. The ring buffers are a standard

ockless shared memory data structure built on grant table andvent channels which are two primitives in Xen architecture.

The grant table can be used to share pages between domain Und domain 0. The frontend of the split driver in domain U canotify Xen hypervisor that a memory page can be shared withomain 0. Domain U then passes a grant table reference throughhe event channel to domain 0 that copies data to or from the mem-ry page of domain U. Once completing the page access, domain Uemoves the grant reference. Page sharing is useful for synchronous/O operations such as sending packets through a network device.

eanwhile, domain 0 may not know the destination domain for anncoming packet until the entire packet has been received. In thisituation, domains 0 will first DMAs the packet into its own mem-ry page. Then, domain 0 can choose to copy the entire packet to theomain U′ memory. If the packet is large, domain 0 will notify Xenypervisor that the page can be transferred to the target domain. The domain U then initiates a transfer of the received page and

eturns a free page back to hypervisor for the next one.In full-virtualization, Intel VT technology defines two modes:

oot mode and non-root mode for virtual machines. In root mode,irtual machine has the whole privilege and full control of the

d Software 86 (2013) 367– 376

processor(s) and other platform hardware. In non-root mode,virtual machine can only operate with limited privilege. Corre-sponding to two modes, VT provides a new form of processoroperation called virtual machine extension (VMX) operation. Thereare two kinds of VMX operation: VMX root operation and VMX non-root operation. In general, a VMM will run in VMX root operationand guest software will run in VMX non-root operation. The tran-sition between VMX root operation and VMX non-root operationis called VMX transition. There are two kinds of VMX transitions.Transition into VMX non-root operation is called VM entry. Transi-tion from VMX non-root operation to VMX root operation is calledVM exit.

In Xen, domain 0 runs in root mode and Hardware Virtualizationbased VM (HV2M) runs in non-root mode. When applications inHV2M need to access hardware resources such as IO devices, VMXtransition will occur. First of all, the current running scene will besaved into a virtual machine control structure (VMCS), and the rootmode scene will be load from it. By this way, the HV2M domainis scheduled out and domain 0 is scheduled in. Then it is time forhandling the real IO request by device module. After that, domain0 is scheduled out and the domain switches back to HV2M. Afteranalysing the process of VMX transition, we can find the actual IOhandle cost only take little part in the total overhead in one switch.As all privileged access such as IO access or interrupt will be handledin this way, the performance of HV2M degrades about 10–30%.

By detailing the network communication process between VMsin para-virtualization, we can find that there is no direct commu-nication channel between guest VMs and all communications needto be forwarded by domain 0 which decrease both the performanceof the communication between guest VMs and the performance ofdomain 0.

In full-virtualization, the communication between VMs underdifferent modes is not smooth. Ideally, we would like that the com-munication between VMs on the same physical machine should besimple and direct, and has high performance. Therefore, we proposea new communication method that can set up direct communica-tion channel between VMs.

3. Related work

There have been some researches to improve the inter-VMcommunication performance in virtualization environment. Forexample, XWay (Kim et al., 2008), XenSockets (Zhang et al., 2007)and IVC (Huang et al., 2007) have developed tools that are more effi-cient than traditional communication path that needs to via domain0. XWay provides transparent inter-VM communication for TCP-oriented applications by intercepting TCP socket calls beneath thesocket layer. It requires extensive modifications to the implemen-tation of network protocol stack in the core OS, since Linux doesnot seem to provide a transparent netfilter-type hooks to interceptmessages above TCP layer. XenSockets is a one-way communica-tion pipe between two VMs, which is based on shared memory.It defines a new kind of socket, with associated connection estab-lishment and read–write system calls that provide interface to theunderlying inter-VM shared memory communication mechanism.In order to use these calls, user applications and libraries need tobe modified. XenSockets is suitable for applications that are highthroughput distributed stream systems, in which latency require-ment is relaxed, and that can perform batching at the receiver side.IVC is a user level communication library intended for messagepassing HPC applications. It can provide shared memory communi-
cation across VMs that reside within the same physical machine. Italso provides a socket-style user-API, through which an IVC awareapplication or library can be written. IVC is beneficial for HPC appli-cations that can be modified to explicitly use the IVC API. In other

Y. Bai et al. / The Journal of Systems and Software 86 (2013) 367– 376 369

ware-

aiptl(e

4

4

aceostcsoa

rsUcfdd

tpcdcwWsI

event channel between each pair of VMs.As each VM has to connect with all other VMs residing within the

host OS, there are many event channels in each VM. This spending isreceivable even if any pair of VMs set an event channel. VM needs to

Fig. 1. Xen architecture with hard

pplication areas, XenFS improves file system performance throughnter-VM cache sharing. HyperSpector (Kourai and Chiba, 2005)ermits secure intrusion detection through inter-VM communica-ion. Prose (Hensbergen and Goss, 2006) utilizes shared buffers forow-latency IPC in a hybrid microkernel-VM environment. ProperMuir et al., 2005) introduces techniques that allow multiple Plan-tLab services to cooperate with each other.

. IVCOM of PV

.1. IVCOM overview

Our IVCOM borrowed some consideration from others, after allny inter-VM communication method has to be dependent on theommon bottom mechanism Xen provides, such as grant table andvent channel. Different from them mentioned above, IVCOM notnly utilizes the page share of two VM, but also does it possible tohorten the packet path of passing network protocol stack. Besideshe common advantage of others, IVCOM also has its own larrupingharacteristic. That is, with netfilter applied, a inter-VM packet canave the time of passing IP layer in part and link layer outright with-ut any modification to OS core. Also, it is application-transparentnd full of usability.

In IVCOM, there is a discovery module within domain 0, which isesponsible for collecting VM information and managing resourcesuch as event channel table and shared memory. In each domain, there is a manage module which is used for the management ofommunication channels, communication module which is usedor the communication between VMs and an IVCOM switch thatecides whether it uses IVCOM or not in specific scenario. Theetails are illustrated in Fig. 2.

As we have discussed in Section 2, the traditional communica-ion way requires the involvement of domain 0 and results in lowererformance. With IVCOM, we can establish high performanceommunication channel between VMs with little maintenanceone by domain 0, which alleviate the stress of domain 0. The dis-overy module within domain 0 maintains an event channel table
hich records necessary information for inter-VM communication.hen a new VM is created, first of all, it will send a registration mes-
age to discovery module. This message comprises of the new VM’sP and domain ID which is used to identify each VM within the host

assisted virtual machine support.

OS. After receiving this registration message, discovery module willset up a record with several properties for the new VM in the eventchannel table. Then the manage module in the new VM starts thechannel bootstrap process with each existing virtual machine byaccessing the event channel table. The process completes the allo-cation of event channel and the establishment of shared memory.After this, the new VM can exchange with any VMs residing withinthe same physical computer and achieve high performance inter-VM communication through the data structure of circular queue inthe shared memory space.

4.2. Event Channel Table

In IVCOM, we have to use event channel to deliver notificationsbetween VMs. In Xen architecture, events are the standard mech-anism for delivering notifications from the hypervisor to guests,or between guests. They are similar to UNIX signals, and can beused for the inter-VM communication. Here we use event channelsas part of the communication channels between VMs, and set an

Fig. 2. IVCOM overview.

370 Y. Bai et al. / The Journal of Systems and Software 86 (2013) 367– 376

kTrSccT

••••

Ve“t“ncsmttt

tVmvhitpitonc

4.4. Communication channel establishment

Fig. 3. Event channels table.

now which channel connects with the communication target VM.herefore, we set up a global event channels table in XenStore toecord the necessary information of VM as illustrated in Fig. 3. Xen-tore is a hierarchical namespace which is shared between VMs. Itan be used to store information about the VMs during their exe-ution and as a mechanism of creating and controlling VM devices.here are several properties in the event channel table as follows:

IP: VM’s network address.VM ID: VM’s domain ID.Port: event channel port num.Status: VM’s status including running(r), pause(p), shutdown(s).

Ms can access “IP” in the event channels table to know the exist-nce of other VMs that reside in the same physical computer.VM ID” is used to identify each VM, and “port” is used to iden-ify each event channel in a VM. The terms “event channel” andport” are used almost interchangeably in Xen. Each event chan-el contains two ports. One is in the VM that allocates the eventhannel and the other is used for remote VM to bind. From the per-pective of either VM, the local port is the channel; it has no othereans of identifying it. So the VM can access the event channels

able to know which port connects the target VM. “Status” showshe current status of each VM. Only the VM is in the running statushat the corresponding event channel can be used.

To guarantee the validity of the information in the event table,he discovery module will periodically collect the information ofMs to update the event channels table. We set the discoveryodule to update every 5 min, which can be adjusted to keep the

alidity of records. Due to the event channels table located withinypervisor, VMs need to communicate with hypervisor to access

t, which leads to too much overhead and makes the table becomehe performance bottle-neck. To solve this problem and gain higherformance access to the event channels table, we copy the table

n each VM. When the VM is created, domain 0 will send it a copy ofhe table and will update the table periodically to keep the validity
f the data. In this way, when a VM has a requirement of commu-ication, it can look up to its local event channels table to establishommunication path which improves performance greatly.
Fig. 4. Shared memory mapping.

4.3. Shared memory mapping

Xen’s grant table provides a generic mechanism for memorysharing between VMs, which allows shared memory communica-tion between unprivileged VMs. The data transmission of inter-VMis possible using data copy method through the shared memoryspace.

IVCOM uses the data structure of circular queue to store datain the shared memory space as illustrated in Fig. 4. The circularqueue is a producer–consumer circular buffer that avoids the needfor explicit synchronization between the producer and the con-sumer endpoints. The circular queue resides in the shared memorybetween the participating VMs. Both front and back are atomicallyincremented by the consumer and producer, respectively, as theyinsert or delete data packets into or from the circular queue. Whenmultiple producer threads might concurrently access the front ofthe queue, we can use producer spinlocks to guarantee mutuallyexclusive access which still do not require any cross-domain syn-chronization. The multiple consumers concurrently accessing theback of the queue also can be solved by this way.

When a new VM is created, the manage module will set up apair of virtual circular queue for sending and receiving data (SQand RQ) with each of the existing VM within the host OS. At thesame time, the existing VMs also set up a pair of virtual circularqueue for data exchange with the new VM. These virtual circularqueues in each VM do not have shared memory space. Actually, theyare mapped to the physical circular queues which are set up by thediscovery module in the shared memory space between the VMs asillustrated in Fig. 4. Consequently, a queue in the shared memoryspace is mapped to the SQ in guest VM1 and also mapped to the RQin the VM2. In this way, the operation of putting the data into theSQ of VM1 is equal to the operation of putting the data into the RQof VM2 which achieve the transmission between VMs by one timememory copy (Huang et al, 2008). Once the event channel and thecircular queues are set up, they would not be destroyed until oneof the VMs that the communication channel connects is destroyed.

When a new VM is created, its management module will send aregister message to the discovery module in domain 0 and receive

Y. Bai et al. / The Journal of Systems an

amqt

obsccr

Vottmisaccsmmsscod3

4

rcttwO

Fn

Fig. 5. Channel establishing process.

copy of event channel table. After that, it starts to establish com-unication channel including one event channel and two circular

ueues with each existing VM. The establishing process is similaro the “client–server” connection setup as illustrated in Fig. 5.

As the number of VM in one physical machine grows, the costf creating all communication channels between any pair of VMs isecoming unignorable, which need the control. Considering thatome IVCOM communication channels may be unnecessary, wean make IVCOM configurable. Only the configured communicationhannel will be established. Thereby, performance is improved andesources of event channels are saved.

During the channel establishing process, we designate the guestM with the smaller guest ID as the role of “server”, whereas thether VM as the role of “client”. First of all, the manage module ofhe server side sends a connect request message to hypervisor, andhe hypervisor will send the domain id of server side to the manage

odule of client side. Client side accepts the request and returnsts domain id to the server side. Then the management module oferver side sets up send circular queue and receive circular queue,nd allocates an event channel. After this, it will send a messageontaining the information of the circular queues and the eventhannel. Receiving the message, the management module of clientide will set up receive circular queue and send circular queue, andap them to the circular queues of server side by the mappingethod described in the shared memory design section. The client

ide also needs to bind to the event channel allocated by the serveride, and sends an ack message to the server side. Since then, thehannel establishing process is complete. To protect against the lossf either message, the server side will time out if the return ackoes not arrive as expected and resend the create channel message

times before giving up.

.5. Data sending and receiving

When application in the sender VM has a communicationequirement, the IVCOM switch layer will look up to the local eventhannel table to find whether the communication target is withinhe same physical computer. If the target does not reside withinhe host OS or the status of the target is not running, IVCOM switchill let the front driver handle the communication requirement.
therwise, IVCOM will take over to handle the requirement.
IVCOM gets the packets by netfilter, then the data sending starts.irst of all, the communication module will access the event chan-el table to get the right event channel port and the VM’s status

d Software 86 (2013) 367– 376 371

that connects the target VM. Then the VM will choose a virtualCPU to bind to the port. As only one core can bind to the port atone time, so the virtual CPUs in the same VM need to negotiateto achieve dynamical binding. The asynchronous communicationbetween virtual CPUs in the same VM is implemented basing onadvanced programmable interrupt controller (APIC) (Intel, 2008)which is developed by Intel and used in the communication inmulti-core platform. APIC is also adopted by Xen and used in thecommunication between virtual CPUs. We define two functions sothat one is used for releasing the event channel called release evtchand another is used for virtual CPUs to bind to the event channelcalled bind evtch. Therefore, a core that wants to use the eventchannel can send an inter-processor interrupt provided by APIC tothe core that is binding to the event channel with release evtchwhich will make the event channel available, and then bind itselfby using bind evtch. By this way, the control of the event channelcan be transferred within virtual CPUs in a VM.

Once the virtual CPU gains the event channel, the direct dataexchange can carry through between VMs. The sender VM copiesits data packet into the SQ, and then it signals the target VM withthe event channel. The target VM intercepts the signal and copiesthe data packet from the RQ, then frees up the space for future dataand returns an ack signal through the event channel. After receivingthe packets through the IVCOM, it submits them to upper layer bynetfilter. The introduction of netfilter mechanism aims to curtailthe path length of protocol stack that inter-VM packets go through.

5. IVCOM in HV2M

5.1. PCI module in Linux

Our IVCOM cannot be available directly in the situation of HV2M,as it lacks under-layer supports from Xen like in Pava-virtualization.AS an un-modified guest OS kernel, HV2M cannot use hypercallsapplied by Xen to send notification between guest domains, suchas event channel and grant table. However, there exists PV on HVMin Xen community, which is a mixture of para-virtualization andfull-virtualization. The primary goal of PV on HVM is to boost per-formance of fully virtualized HVM guests through use of speciallyoptimized paravirtual device drivers. It introduces a PCI modulenamed platform-PCI in Linux.

Insertion of the PCI module into domU Linux kernel can enableus to access part of hypercalls of Xen. Its foundational mechanismis that we can access some hypercalls just like in para-virtualizeddomU by mapping its one memory page to the hypercall page ofXen. Although the entire hypercall page of Xen can be mapped bydomU through the PCI module, only a part of hypercalls and part offunction called by a hypercall can be completed successfully. Thecheck of domain type of each hypercall blocks any illegal hypercallfrom domU above.

Fortunately, only the part of hypercalls available to HV2M cansupport the implementation of IVCOM in HV2M. Through ourresearch, event channel op hypercall and memory op hypercallcan all be supported in x86 and x86 64 of dom0 kernel, and partof grant table op can only be achieved in dom0 x86 64 kernel. Thesituation in HV2M of Windows is the same as Linux.

5.2. Overview of IVCOM in HV2M

As we have discussed in Section 2, if the applications in HV2Mtry to access privileged resources such as IO, the HV2M will execute
VM exit and the domain will be scheduled by hypervisor to handlethe requests. After handling the request, the domain will executeVM entry and it will switch back to HV2M. The overhead of frequenttransition between root mode and non-root mode is considerable.


COM in HV2M.

BhwicVg

nwcaaItpP

cm

6

6

Im2agmt

•

•

•

na

Table 1Average throughput.

IVCOM Split driver Host

Netperf TCP (Mbps) 5539 2405 6197

ing to Host latency, IVCOM yields 1.18 times higher with netperfTCP RR, 1.17 times higher with netperf UDP RR, 1.21 times higherwith lmbench TCP, and 1.19 times higher with lmbench UDP, and1.67 times higher with ping. This illustrates that the latency of

Table 2Average latency.

IVCOM Split driver Host

Netperf TCP RR (�s) 17.89 47.78 15.1

Fig. 6. IV

y analyzing the reason of VMX transition, we know that HV2Mas some tasks done by domain 0 and they need to communicateith each other. Then we are eager to find out if IVCOM is suitable

n HV2M and the result proves the hypothesis. Although IVCOMannot put an end to tuning as using hypercalls also will causeMX transition, it can reduce the total number of VMX transitionsreatly.

After inserting PCI module introduced above into the Linux ker-el as illustrated in Fig. 6, we can use IVCOM in HV2M in the sameay as what we do in para-virtualized domain. In HV2M, IVCOM

an establish the direct communication channel between HV2Mnd domain 0 to achieve high performance communication. Whenpplications in HV2M have to access privileged resources, it can useVCOM to exchange information with domain 0 and let domain 0 dohe request instead of VMX transitions which degrades the HV2Merformance a lot. It’s similar in Windows OS or Linux, but a virtualCI device.

In nutshell, due to the support from PCI module, IVCOM of PVan be implemented entirely in HV2M, including shared memoryapping, event channel establishment, data sending and receiving.

. Experiments in para-virtualization

.1. Experiments configuration

In this section, we will show the performance evaluation ofVCOM in para-virtualization environment. We perform the experi-

ents on a test machine with Intel Q9300 2.5 GHz 4-core processor, MB cache and 4 GB main memory. The hypervisor is Xen 3.2.0nd the guest OS kernel is para-virtualized Linux 2.6.18.8. Twouest VMs for inter-VM communication are configured on the testachine, each one with 512 MB memory allocated. We compare

hem in the following three scenes:

IVCOM: Guest to guest communication through IVCOM inter-VMcommunication mechanism.Split driver: Guest to guest communication through the splitdriver.Host: Network communication between two processes withinthe Host OS. This experiment works as a standard comparisonfor other scenes.

To carry out the experiments, we have used some benchmarks:etperf, lmbench (McVoy and Staelin, 1996) and ping. Netperf is

benchmark that can be used to measure the performance of

Netperf UDP (Mbps) 5723 2533 6838Lmbench TCP (Mbps) 5896 2438 7467

many different types of networking. It provides tests for both unidi-rectional throughput, and end-to-end latency. In the environment,results currently measured by netperf include TCP and UDP via BSDSockets for IPv4 and IPv6, DLPI, Unix Domain Sockets and SCTPfor both IPv4 and IPv6. Lmbench is a set of utilities to test theperformance of Unix system producing detailed results as well asproviding tools to process them. It includes a series of micro bench-marks that measure some basic operating system and hardwaremetrics.

6.2. Experiment overview

The results of the throughput we have measured across threedifferent communication scenes are compared in Table 1. We canobserve that the IVCOM improves the throughput over the splitdriver more than twice and can reach a factor of 0.78–0.89 of theHost throughput, which is considered as the maximum throughput.

Table 2 compares the measured latency across the three com-munication scenes through using five request–response workloadsgenerated by benchmarks. Comparing to split drive, we can findthat IVCOM performs 3 times smaller latency with netperf TCP RR,4 times smaller latency with netperf UDP RR, 2.3 times smallerlatency with lmbench TCP and 2.7 times smaller latency withlmbench UDP, and 2.2 times smaller ping latency. And compar-

Netperf UDP RR (�s) 14.35 61.27 12.3Lmbench TCP (�s) 17.97 41.9 14.81Lmbench UDP (�s) 14.7 40.4 12.4Ping RTT (�s) 25 55 15


64 12 8 25 6 51 2 102 4 204 8 409 6 819 2 1638 4 3276 80

1000

2000

3000

4000

5000

6000

7000T

hro

ug

hp

ut(

Mb

ps

)

4K

8K

16 K

32 K

Il

6

pt

wbsaIdmdHpaetHo

tspWf1Wkoc

nWdtplddw

64 12 8 25 6 51 2 102 4 204 8 409 6 819 2 1638 4 3276 80

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Th

rou

gh

pu

t(M

bp

s)

Messa ge size (bytes )

Split Drive r

IVCO M

Hos t

Fig. 8. Impact of circular queue size on throughput.

10 10 0 100 0 1000 00

50

100

150

200

250

300

350

400

La

ten

cy

(mic

ros

ec

on

d)

IV COM-UD P

IV COM-TCP

Split Driver-UD P

Split Driver-TC P

After that, we continue to measure the throughput while weincrease the total number of VMs on the physical machine. In thesefour tests, each VM is assigned one virtual CPU. The test is based onnetperf’s UDP STREAM test and the result is illustrated in Fig. 10.

Messa ge size(bytes )

Fig. 7. Impact of message size on throughput.

VCOM is much smaller than split driver and is very close to theatency of host.

.3. The impact of message size on performance

We have measured the throughput in three scenes by using net-erf’s UDP STREAM test with the sending message size increasing,he results of which are shown in Fig. 7.

Throughput increases in all three communication scenes alongith the increase of message size. This is because a large num-

er of system calls are used to send the same number of bytes withmaller message size, which results in more crossings between usernd kernel mode. When the message size is larger than 512 bytes,VCOM achieves higher throughput than split driver. The splitriver throughput increases slowly and reaches its peak when theessage size is 4K bytes. As we have discussed in Section 2, split

river uses share memory page to exchange data between VMs.ence, the maximum data size it can exchange for one time is oneage which is 4K bytes. In IVCOM, we set the circular queue sizet 32K bytes; therefore, the throughput of IVCOM increases lin-arly and reaches the top when the message size is 32k bytes. Ashere is no shared memory space limitation in the Host scene, theost throughput almost increases linearly along with the increasef message size.

We set the circular queue at size 4K, 8K, 16K and 32K bytes, andhen we test IVCOM with different message size. The results arehown in Fig. 8. When the circular queue is 4K bytes, the through-ut increases slowly and almost reaches the peak at size 4K bytes.hen the circular queue is 8K bytes, the throughput increases

aster than 4K, and reaches its peak at size 8K bytes. Similarly,6K increases faster than 8K, and touches the top at size 16K bytes.ith 32K bytes, we get the similar results. From this experiment we

now that increasing the circular queue size has a positive impactn the achievable throughput. In other experiments, we set theircular queue size at 32K bytes.

We also measure the latency of IVCOM and split driver by usingetperf UDP RR test and netperf TCP RR test as illustrated in Fig. 9.hen the message size is smaller than 4K bytes, the latency of split

river keeps steady. Once the message size is larger than 4K bytes,he latency increases rapidly. That is because split driver uses oneage (4K bytes) to exchange data each time. If the message size is

arger than one page size, then it needs to two pages or more to sendata which leads to the rapid latency increase. Comparing to splitriver, IVCOM has a much smaller latency. The latency keeps steadyhen the message size is smaller than the circular size. When the

Messa ge size (bytes )

Fig. 9. Impact of message size on latency.

message size is too large, the latency of IVCOM also increases a litterwhich is endurable comparing to split driver.

6.4. The impact of virtual machine on performance

Fig. 10. Impact of VM number on throughput.


Wdtbt

doeo

ariCwoatvtteaiav

7

7

iacvgW

•

•

ac

message size limitation only if there is enough space.We repeat the experiment with IVCOM and the results are

shown in Fig. 13. As we have talked above, the message size is about1400 bytes in traditional ways, and the corresponding number of

Fig. 11. Impact of virtual CPU number in each domain on throughput.

e use the host test result which is of two processes in the sameomain as a standard for comparison. From the figure, we can seehat as the VM amount is increased, the throughput is decreasedoth in split driver and IVCOM and the throughput of Host is kepthe same.

From the results, we can find that the throughput of split driver isecreased in a linear manner by the amount of VM. The throughputf IVCOM remains unchanged when the VM amount is less than orqual to 4, and that is decreased in a linear manner by the amountf VM when the VM amount is more than 4.

Finally, our test focusses on the influence about virtual CPUmount per VM. In our test, the machine on which domain 0 isunning has a four-core processor. At first, we simply estimate thatf the virtual CPU amount of a domain U is more than the physicalPU amount, which is four in our experiment, and the performanceill decrease. The result can be seen from Fig. 11. As the amount

f virtual CPU in each VM increases but does not reach the totalmount of physical CPU, which is four CPU in our machine, thehroughput of IVCOM increases greatly. And when the amount ofirtual CPU is equal to or more than the physical CPU total amount,he throughput of IVCOM almost remains the same. So increasinghe throughput of IVCOM by increasing the virtual CPU amount wasffective only if that virtual CPU amount is less than physical CPUmount, and our prediction at the beginning is not so correct. Dur-ng the entire test, the throughput in split driver and Host (sames the prior experiment) keeps steady regardless of the amount ofirtual CPU and its relationship with the amount of physical CPU.

. Experiments in full-virtualization

.1. Experiments configuration

In this section, we show the performance evaluation of IVCOMn full-virtualization environment. We perform the experiments on

test machine with Intel Q9300 2.5 GHz four-core processor, 2 MBache and 4 GB main memory. We also use Xen 3.2.0 for the hyper-isor and full-virtualized Linux 2.6.18.8 for the guest OS kernel. Theuest VMs on the test machine are configured to 512 MB memory.e compare them in the following two scenes:

HV2M IVCOM: HV2M to domain 0 network communicationthrough IVCOM.HV2M Tuning: HV2M to domain 0 network communicationthrough traditional VMX transitions.

To carry out the experiments, we write a benchmark sending 500 M bytes data from HV2M to domain 0. We use Xentrace toalculate the VMX transition times in two scenes.

Fig. 12. VMX transition times with HV2M Tuning.

Xentrace is a useful performance profiler, tracing all eventsoccurring in Xen VMM. For HV2M domain, as long as an unmodifiedguest OS accesses a privileged resource, the processor will trigger aVM Exit event, and pass control to the Xen hypervisor. Accordingly,different VM exit events need different handling costs in hypervi-sor. So, performance tuning work success can be measured on thecost data collected by Xentrace, and targeted for minimizing thetotal cost for each VM Exit. Here, Xentrace is a perfect tool to ana-lyze VM Exit behavior. Xentrace is a built-in method that Xen canuse to trace important events happening within the hypervisor.

In our study, we focus on the total VMX transition times reducedby using IVCOM instead of traditional communication ways inHV2M.

7.2. Experiment results and analysis

First of all, we use the traditional way to send 500M bytes datafrom HV2M to domain 0. In each time, we adjust the sending datasize to test the impact of size on VMentry/exit times.

From Fig. 12, we know that the VMentry/exit times increasealong with the decrease of the message size. In another word, wecan say VMentry/exit times increase along with the increase of mes-sage package number. This is because the domain needs to switchbetween HV2M and domain 0 whenever there is a message packagethat needs to be send. Therefore, it is better for HV2M to send datain a big size of message to reduce the total transition times. But,in most situations in HV2M, the message size is small. For exam-ple, message size in socket communication is about 1400 bytes.IVCOM can solve this problem because it can use shared memoryto exchange data between HV2M and domain 0. There is almost no

Fig. 13. Total VMX transition times with HV2M IVCOM.


tetwwttobrVus0nt

isa

i1poes

Fig. 14. Latency in traditional VMX transitions.

ransition times is about 4.95E+6. We take this number as a ref-rence for the results in Fig. 10. When the shared memory is seto be one page (as we can make the message size as large as thehole shared memory), the number of transition times is 1.69E+6,hich is only about one-third of the number of transitions times in

raditional ways. When the shared memory is set to be two pages,he number of transitions times is 8.85E+6, which is only aboutne-fifth of the VMX transitions number. As the shared memoryecomes larger, the number of VMX transitions times decreasesapidly. This is mainly because it can send more data during eachMX transition. By this way, it will occur less VMX transitions bysing IVCOM instead of traditional ways when there is the sameize data that need to be exchanged between HV2M and domain. Therefore, IVCOM can improve the performance by reducing theumber of VMX transition times and improve the effective opera-ions.

Through the experiments, we also find that the latency of IVCOMs bigger than the latency of traditional VMX transitions. And theize of shared memory in IVCOM also affects the latency. The resultsre shown in Figs. 14 and 15.

From Fig. 14, we know that the latency of traditional transitions between 200 and 300 �s while the latency of IVCOM is between300 and 1700 �s when we set the shared memory to 16 memoryages (each page is 4096 bytes). To have a further understandingf the effect of shared memory size on the latency, we repeat the
xperiment and adjust the shared memory size, and the results arehown in Fig. 16.
64 12 8 25 6 51 2 102 4 204 8 409 6 819 2 1638 4 3276 8100

120

140

160

180

200

220

240

260

280

300

late

nc

y(u

s)

messa ge size (bytes )

UD P

TC P

Fig. 15. Latency in IVCOM.

Fig. 16. Latency in IVCOM with different shared memory sizes.

When the shared memory size is small, the latency of IVCOM issmall. As when the shared memory size is one page (4096 bytes), thelatency is only about 400 �s. The latency is still endurable for mostapplications. As the shared memory size increases, the latency alsoincreases. The latency can touch 1500 �s when the shared memorysize is 16 memory pages which is insufferable for many applicationsespecially those with latency sensitive applications.

8. Conclusion

With the resuscitation of virtualization technology and thedevelopment of multi-core technology, using virtualization toenforce isolation and security among multiple cooperating com-ponents of complex distributed applications draws more and moreattention. This makes it exigent for the virtualization technologyto enable high performance communication among VMs. In thispaper, we present the design and the implementation of IVCOM,a high performance inter-VM communication method. Evaluationusing a number of benchmarks also demonstrates a significantincrease in communication throughput and reduction in inter-VMround trip latency. For future work, we are presently investigatingif IVCOM can have a better performance when the message size issmall. We also want to get a further study about the impact of multi-core and virtual multi-core on the performance of IVCOM whichmay help us to improve its performance on multi-core platform.

Acknowledgements

This work is supported by the National Science Foundation ofChina under Grant No. 61073076, Ph.D. Programs Foundation ofMinistry of Education of China under Grant No. 20121102110018,Science and Technology on Information Transmission and Dissem-ination in Communication Networks Laboratory under Grant No.ITD-U12001, and Beihang University Innovation & Practice Fundfor Graduate. The authors would like to thank the anonymous ref-erees for their valuable comments, which have helped to improvethe manuscript.

References

Adams, K., Agesen, O., 2006. A comparison of software and hardware techniques forx86 virtualization. In: Proceedings of 12th International Conference on Archi-tectural Support for Programming Languages and Operating Systems, San Jose,CA, USA, 21–25 October.

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt,I., Warfield, A., 2003. Xen and the art of virtualization. In: 19th ACM Symposium
on Operating Systems Principles.
Figueiredo, R., et al., 2005. Resource virtualization renaissance. IEEE Computer 38(May (5)), 28–31.

Hensbergen, E.V., Goss, K., 2006. Prose i/o. In First International Conference on Plan9 , Madrid, Spain.

3 ems an

H

H

I

K

K

M

M

M

M

S

W

Z

DiscilCa

puter science from Beihang University, Beijing, China,in 2012. He is currently a research assistant in Bei-hang University. His research interests include computerarchitecture, parallel computing, system virtualization,reconfigurable computing.

76 Y. Bai et al. / The Journal of Syst

uang, W., Koop, M.J., Panda, D.K., 2008. Efficient one-copy MPI shared memorycommunication in virtual machines. In: IEEE Cluster.

uang, W., Koop, M., Gao, Q., Panda, D.K., 2007. Virtual machine aware communi-cation libraries for high performance computing. In: SuperComputing (SC’07),Reno, NV, November.

ntel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: Sys-tem Programming Guide. Part 1, Chapter 9: Advanced Programmable InterruptController (APIC), Order Number: 253668-029US, November 2008.

im, K., Kim, C., Jung, S.-I., Shin, H., Kim, J.-S., 2008. Inter-domain socket communi-cations supporting high performance and full binary compatibility on Xen. In:Virtual Execution Environments (VEE’08).

ourai, K., Chiba, S., 2005. HyperSpector: virtual distributed monitoring envi-ronments for secure intrusion detection. In: Virtual Execution Environments(VEE’05).

cVoy, L., Staelin, C., 1996. Lmbench: portable tools for performance analysis. In:Proc. of USENIX Annual Technical Symposium.

enon, A., Santos, J.R., Turner, Y., Janakiraman, G.J., Zwaenepoel, W., 2005. Diagnos-ing performance overheads in the Xen virtual machine environment. In: VirtualExecution Environments (VEE’05).

enon, A., Cox, A.L., Zwaenepoel, W., 2006. Optimizing network virtualization inXen. In: USENIX Annual Technical Conference, Boston, MA.

uir, S., Peterson, L., Fiuczynski, M., Cappos, J., Hartman, J., 2005. Proper: privilegedoperations in a virtualised system environment. In: USENIX Annual TechnicalConference, Anaheim, CA.

ugerman, J., Venkitachalam, G., Lim, B., 2001. Virtualizing I/O devices on VMwareworkstation’s hosted virtual machine monitor. In: USENIX Annual TechnicalConference.

hitaker, A., Shaw, M., Gribble, S., 2002. Denali: lightweight virtual machines fordistributed and networked applications. In: The USENIX Annual Technical Con-ference, Monterey, CA.

hang, X., McIntosh, S., Rohatgi, P., Griffin, J.L., 2007. Xensocket: A High-ThroughputInterdomain Transport for Virtual Machines. Middleware.

Yuebin Bai received his PhD degree in computer sciencefrom Xi’n Jiaotong University, Xi’an, China, in 2001. From2001 to 2003, he was engaged in postdoctoral researchas a visiting professor in College of Science and Tech-nology of Nihon University, Tokyo, Japan. In 2003, hejoined the faculty of Beihang University, where he is cur-rently a full professor in School of Computer Science andEngineering. He also is a visiting research fellow in theScience and Technology on Information Transmission andDissemination in Communication Networks Laboratory.He has been a principal investigator of several researchprojects including National Natural Science Foundation ofChina (NSFC), and National High Technology Research and

evelopment (863) Program of China. He has published about 60 research papersn key International journals and conferences in the areas of system virtualization,ervice-oriented mobile ad hoc networks, wireless sensor networks, and pervasive
omputing. He is the inventor of 8 Chinese invention patents and one Japanesenvention patent. His current research interests include system virtualization, wire-ess networks, real time and distributed systems. He is a senior member of the Chinaomputer Federation (CCF), a member of the ACM and IEEE Computer Society, andlso a member of the IEICE.
d Software 86 (2013) 367– 376

Yao Ma received his BS degree in computer science fromBeihang University, Beijing, China, in 2010. He is pursuinga MS degree in computer science in Beihang University.His Research interests include computer architecture, sys-tem virtualization and embedded systems.

Cheng Luo received his BS degree and MS degree all incomputer science from Beihang University, Beijing, China,in 2007 and 2010 respectively. He is currently a PhDcandidate in Graduate School of Information Science andTechnology, the University of Tokyo, Tokyo, Japan. He haspublished several research papers in International con-ferences as well as several invention patents in the areasof system virtualization and multi-core systems. His cur-rent research interests include low power technologies forhigh performance computer, system virtualization.

Duo Lv received his BS degree in computer science fromShanghai Jiaotong University, Shanghai, China, in 2009.From 2011 to 2012, he worked towards system virtual-ization as a research fellow in Beihang University, Beijing,China. He is currently a PhD student in Arizona StateUniversity, USA. His research interests include system vir-tualization, distributed and embedded systems.

Yuanfeng Peng received his bachelor’s degree in com-

a high performance inter-domain communication approach for virtual machines

Documents