vbalance: using interrupt load balance to improve i/o...

vBalance: Using Interrupt Load Balance to Improve I/OPerformance for SMP Virtual Machines

Luwei Cheng, Cho-Li WangDepartment of Computer Science

The University of Hong KongPokfulam Road, Hong Kong, P.R. China

{lwcheng, clwang}@cs.hku.hk

ABSTRACTA Symmetric MultiProcessing (SMP) virtual machine (VM)enables users to take advantage of a multiprocessor infras-tructure in supporting scalable job throughput and requestresponsiveness. It is known that hypervisor scheduling ac-tivities can heavily degrade a VM’s I/O performance, asthe scheduling latencies of the virtual CPU (vCPU) even-tually translates into the processing delays of the VM’s I/Oevents. As for a UniProcessor (UP) VM, since all its in-terrupts are bound to the only vCPU, it completely relieson the hypervisor’s help to shorten I/O processing delays,making the hypervisor increasingly complicated. RegardingSMP-VMs, most researches ignore the fact that the problemcan be greatly mitigated at the level of guest OS, instead ofimposing all scheduling pressure on the hypervisor.In this paper, we present vBalance, a cross-layer soft-

ware solution to substantially improve the I/O performancefor SMP-VMs. Under the principle of keeping hypervisorscheduler’s simplicity and efficiency, vBalance only requiresvery limited help in the hypervisor layer. In the guest OS,vBalance can dynamically and adaptively migrate the inter-rupts from a preempted vCPU to a running one, and henceavoids interrupt processing delays. The prototype of vBal-ance is implemented in Xen 4.1.2 hypervisor, with Linux3.2.2 as the guest. The evaluation results of both micro-leveland application-level benchmarks prove the effectiveness andlightweightness of our solution.

Categories and Subject DescriptorsC.4 [Performance of Systems]: Design Studies; D.4.4[Operating Systems]: Communications Management—In-put/Output

General TermsDesign, Experimentation, Measurement, Performance

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SOCC’12, October 14-17, 2012, San Jose, CA USACopyright 2012 ACM 978-1-4503-1761-0/12/10 ...$15.00.

KeywordsVirtualization, Xen, SMP, Cloud Computing

1. INTRODUCTIONVirtualization technology allows running multiple VMs on

one physical host by multiplexing the underlying physical re-sources. Modern cloud data centers are increasingly adopt-ing virtualization software such as VMware [41], Xen [15],KVM [26] and Hyper-V [9], for the purpose of server consoli-dation, flexible resource management, better fault tolerance,etc. Each VM is given the illusion of owning dedicated com-puting resources. A VM can be easily configured with dif-ferent settings, such as the amount of CPU cycles, networkbandwidth, memory size and disk size. A virtual SMP in-frastructure can be easily created by assigning the VM morethan one vCPU. Compared with UP-VMs which have onlyone vCPU, a SMP-VM allows running multi-threaded ormulti-processed applications, by moving tasks among avail-able vCPUs to balance the workload and thus more fullyutilizes the processing power. It is particularly attractive forenterprise-class applications such as databases, mail servers,content delivery network, etc.

Virtualization can cause performance problems which donot exist in a non-virtualized environment. It is alreadyknown that the hypervisor scheduler can significantly affecta VM’s I/O performance, as the vCPU’s scheduling latenciesactually translate into the processing delays of the VM’s I/Orequests. The evaluation results [16, 42] have revealed thevery unpredictable network behaviors on Amazon’s EC2 [1]cloud platform. The negative effect reflects as degraded I/Othroughput, as well as longer and unstable I/O latency forapplications. Most researches focus on improving the I/Operformance for UP-VMs [20, 21, 22, 25, 27, 32, 34, 45],whereas little has been done to SMP-VMs, ignoring the factthat the I/O problem in SMP-VMs is substantially differentfrom that in UP-VMs.

For a UP-VM, since all interrupts need to be processedby the only one vCPU, once external events arrive, the onlyway to avoid performance drop is to force the hypervisorto schedule its vCPU as soon as possible. As a result, thepressure of responding to I/O is completely imposed on thehypervisor scheduler, leading to increasing complexity andcontext switch overhead. SMP-VMs allow more flexible in-terrupt assignment to vCPUs and are not that demandingfor the hypervisor’s help to process I/O requests. From theperspective of the hypervisor, since a SMP-VM has multi-

ple vCPUs, it is likely that when one vCPU is descheduledanother vCPU is still running; therefore, if the guest OScan adaptively migrate the interrupt workload from the pre-empted vCPU to a running one, it is unnecessary to botherthe hypervisor scheduler, saving a lot of context switch over-head. From the perspective of SMP guest OS, even thoughthe whole VM gets CPU cycles, if the vCPU that is respon-sible for the interrupts is not scheduled by the hypervisorinstead of other vCPUs, I/O processing delays can still hap-pen. Inappropriate interrupt mapping inside the guest OScan also cause performance problems: 1) if the interruptsare statically mapped to a specific vCPU all the time, thatvCPU can be easily overloaded when I/O workload is heavy;2) since the hypervisor scheduler treats each vCPU fairly byallocating them an equal amount of CPU cycles, unbalancedinterrupt load among vCPUs results in an unequal use ofthem, leading to suboptimal resource utilization.In this paper, we will present vBalance, a cross-layer soft-

ware solution that can substantially improve the I/O perfor-mance for SMP-VMs. vBalance bridges the knowledge gapbetween the guest OS and the hypervisor scheduler: the hy-pervisor dynamically informs the SMP-VM the schedulingstatus of each vCPU, and the guest OS is always trying toassign its interrupts to the running vCPUs. Unlike tradi-tional approaches which totally count on altering the hyper-visor scheduler, vBalance maximally exploits the guest-levelcapability to accelerate I/O speed, and only needs very lim-ited help from the hypervisor. As the hypervisor is expectedto be as lean as possible to keep its efficiency and scalabil-ity, we believe it is the trend to paravirtualize the guest OSmore and more to adapt to the hypervisor.The contributions of this paper can be summarized as:

(1) our design vBalance demonstrates that for SMP-VMs,the guest OS has great potential to help improve the I/Operformance, leaving the hypervisor scheduler unbotheredmost of the time; (2) we implement our vBalance prototypein Xen 4.1.2 hypervisor and Linux 3.2.2 guest OS, involv-ing less than 450 lines of code; (3) we thoroughly evaluatethe effectiveness of vBalance using both micro-benchmarksand cloud-style application benchmarks. The results showthat vBalance brings significant improvement to both I/Othroughput and I/O responsiveness, in sacrifice of very lim-ited context switch overhead in the hypervisor.The rest of the paper is organized as follows. Section 2

presents a detailed analysis of the problem and discusses thepossible approaches. Section 3 presents the design principlesand vBalance’s architecture. The implementation is intro-duced in Section 4. The solution is evaluated in section 5.Section 6 describes the related work. Section 7 discusses thefuture work. We conclude our research in Section 8.

2. PROBLEM ANALYSISIn this section, we first illustrate the problem by compar-

ing the physical SMP with the virtual SMP. We then discussthe possible approaches to address the problem.

2.1 IO-APIC vs. Event ChannelIn the physical world, SMP system features an IO-APIC

(I/O Advanced Programmable Interrupt Controller) chip,to which all processors are connected via an ICC (InterruptController Communication) bus, as shown in Figure 1 (a).

An SMP-aware OS enables a process/thread to run on anyCPU, whether it is a part of the kernel or part of a userapplication. IO-APIC contains a redirection table, whichroutes specific interrupts to a specific core or a set of cores,by writing an interrupt vector to the Local-APICs. The OSreceives the interrupts from the Local-APICs and does notcommunicate with IO-APIC until it sends an EOI (End OfInterrupt) notification. IO-APIC is capable of delivering theinterrupts to any of the cores and even perform load balanc-ing among them. By default it delivers all interrupts to core0. Multiple interrupts will keep one of the cores overloadedwhile the others remain relatively free. The OS schedulerhas no idea about this state of affairs, and assumes that allinterrupt handling cores are as busy as any other core. It isthe task of the interrupt balancing software (such as Linux’sirqbalance [7]) to distribute this workload more evenly acrossthe cores: to determine which interrupts should go to whichcore, and then fill this table for IO-APIC chipset to use. Ifcare is not taken to redistribute the interrupts properly, itcould lead to a decrease in the overall system performance byoverloading some processors and by not optimally utilizingthe remaining processors.

In the virtualization world, para-virtualization has beenadopted (e.g. in Xen [15]) to significantly shrink the cost andcomplexity of I/O handling compared to using full deviceemulation (e.g. QEMU [17]). In Xen’s para-virtualized exe-cution environments, the functionality of IO-APIC chip forVMs is totally replaced by a software event channel mech-anism, as shown in Figure 1 (b). For each VM, there isa global event channel table to record the detailed informa-tion of each channel, such as event type, event state, notifiedvCPU, etc. The guest OS can correlate these events withits standard interrupt dispatch mechanisms. The events aredistributed among all vCPUs, and each vCPU maintainsits own event selection table (acting as the role of Local-APIC). The guest OS informs the hypervisor event bindinginformation, e.g., which vCPU is responsible for network de-vices, and which vCPU takes care of block devices, so thatthe hypervisor can notify the corresponding vCPU when therelative events arrive. After a vCPU is selected by the hyper-visor to run, it will check its own event selection table to seewhether there are pending events. If yes, the correspondinginterrupt handlers will be called to process the events, bycopying data from shared memory and acknowledging thebackend (similar to sending EOI notification).

There are three main differences between the physicalSMP and the virtual SMP. First, the physical cores are al-ways “online”, which means once an interrupt is received bythe local-APIC, the CPU immediately jumps to the inter-rupt gate and fetch the IRQ handler from the IDT (InterruptDescriptor Table) to run, with almost no delay; however, ina virtual SMP system, after an event is delivered to a specificvCPU, it still needs to wait for the hypervisor’s schedule toprocess the interrupts. The scheduling delays are in natureinevitably as there are multiple vCPUs sharing the samephysical core. Second, most IO-APIC chips can be config-ured to route an interrupt to a set of cores (one to many),enabling more than one option for the interrupt’s delivery.However, current event channel mechanism [15] can only de-liver the events to a specific vCPU (one to one), thereby lim-its the hypervisor’s delivering flexibility. Third, unlike the

IO-APIC Chip

Local APIC

CPU 0

External Interrupts

ICC Bus

IRQ Handler

1

2

4

2

3

5

5

Local APIC

CPU 1

IRQ Handler

Local APIC

CPU 2

IRQ Handler

Local APIC

CPU 3

IRQ Handler

EOIINT

(a) IO-APIC for SMP physical machine

Event Channel Table

CPU 0

vCPU 0

vCPU 1

vCPU 2

vCPU 3

Physical CPUs

Virtual

CPUs

Hypervisor scheduler

CPU 1 CPU 2 CPU 3

delay

(b) Event channels for SMP virtual machine

Figure 1: Comparison of I/O mechanisms between physical SMP and virtual SMP

idle process in a traditional OS which consumes all the un-used CPU cycles, the idle process in a paravirtualized guestOS will cause the vCPU to voluntarily yield up its CPUcycles to the hypervisor. As such, if the guest OS can notevenly distribute the workload among all available vCPUsto optimally utilize the CPU cycles, the resources allocatedto the whole VM will be wasted. Since OS schedulers aremostly unaware of the interrupt load on each processor, anextra effort must be paid to balance the interrupt load.Due to insufficient understanding of the problems men-

tioned above, the seemingly faithful design of current I/Ovirtualization techniques can result in very poor performanceunder heavy I/O workload. In the physical SMP system, in-terrupts are migrated to prevent a specific processor frombeing overloaded. In the virtual SMP system, interruptshave to be migrated to avoid the scheduling delays from thehypervisor.

2.2 Possible ApproachesWe group the possible approaches into two categories, de-

pending on the layer where the approach resides.

2.2.1 Hypervisor-level SolutionA non-intrusive approach at the hypervisor layer could be

to modify the hypervisor scheduler, by immediately schedul-ing the vCPU that receives external events, regardless of itspriority, state, credits, etc.This approach can mitigate the problem to some extent,

as it eliminates the scheduling delays of the targeted vCPU.Unfortunately, it also brings several critical problems: first,without interrupt load balancing from the guest OS, allevents will be delivered to only one vCPU (vCPU0 by de-fault), or some specific vCPUs, making them easily over-loaded; second, the interrupt-bind vCPUs will consume muchmore CPU cycles than the others, leading to non-optimalresource utilization; through we can modify the hypervisorscheduler to give the targeted vCPU more CPU cycles thanthe others (in this way, the vCPUs of the same SMP-VM arenot symmetric at all), how to determine the disproportionparameter is also a problem, as the interrupt load is highlyunpredictable; third, the context switch overhead will sub-stantially increase, because the hypervisor scheduler needsto keep swapping the vCPUs in response to the incomingevents, making it too expensive in practice.The hypervisor scheduler treats each vCPU as symmetric,

in the expectation that all vCPUs can behave symmetricallyand utilize the resources in a balanced manner. Without thehelp from the guest OS, the hypervisor scheduler is not ableto fully address the I/O problem.

2.2.2 Guest-level SolutionAt the guest layer, one approach is to monitor the inter-

rupt load of each vCPU, and periodically reassign all theinterrupts to achieve the load balance, like irqbalance [7].

This approach can easily make each vCPU consume ap-proximately an equal amount of CPU cycles. However, themain problem is that there is no way to make accurate de-cisions: at which moment the interrupts should go to whichvCPU. Since the vCPUs can be preempted by the hyper-visor at any time which is totally transparent to the guestOS, if the interrupts are assigned to an offline vCPU, theprocessing will be delayed. The knowledge gap between theguest and the hypervisor eliminates the effectiveness of thisapproach. For a virtual SMP system, the responsibility of in-terrupt migration is not only to achieve load balancing, butmore importantly, to avoid the hypervisor scheduling delaysimposed on the interrupts’ processing. Therefore, withoutthe scheduling status information of each vCPU from thehypervisor, it is impossible for the guest OS to preciselydetermine interrupt migration strategies.

3. DESIGNIn this section, we describe the design goals of vBalance,

and present its components in detail.

3.1 Design GoalsThe followings are the design goals of vBalance to enable

greater applicability as well as ease of use:

• High performance: the design should efficiently uti-lize the advantage of SMP-VMs, that is, the ability ofleveraging more than one vCPU to process interrupts,and therefore more easily to achieve high-throughputand low-latency communication.

• Light-weight: to make it scalable to many guest do-mains, the design must be quite light-weight in thehypervisor layer. Also for ease of development, themodifications required to the existing software mustbe as few as possible.

• Application-level transparent: the deployment ofvBalance should not require any change to the appli-cations, so that they can transparently enjoy the per-formance benefits.

3.2 Main Concerns and PrinciplesBase on the discussion in Section 2.2 and the design goals

in Section 3.1, we prefer a solution that (1) maximally seekshelp from the guest OS to avoid interrupt processing delays,(2) bothers the hypervisor as little as possbile, (3) makesall vCPUs share the interrupt load evenly. By clearly iden-tifying the responsibilities of the hypervisor scheduler andthe guest OS to handle interrupts, we are able to derive thedesign principles of vBalance.For hypervisor scheduler, since it is used very frequently

to manage all the guest domains, it is so important thatthe design and implementation must be very lean and scal-able. Therefore it should be made as simple as possible. Weargue that the main responsibility of the hypervisor sched-uler is to perform CPU time proportional sharing amongVMs. It should not assume too much about the guest’s in-terrupt processing characteristics, e.g., in favor of schedulingone or some specific vCPUs. For a SMP-VM, the hypervi-sor should guarantee that each vCPU receives relatively thesame amount of CPU cycles, so that all vCPUs look “sym-metric” to the guest OS. This is the “mechanism” that thehypervisor should provide to the SMP guest domains.Based on the illusion that all vCPUs are“symmetric”, var-

ious“policy-level” load balancing strategies can be applied inguest domains. Since the capability of one vCPU to processinterrupts is limited, the guest OS should endeavor to utilizeall vCPUs to handle interrupts. Unbalanced use of vCPUscan cause low resource utilization, because the idle processfor each vCPU in the guest OS will voluntarily yield up theCPU cycles to the hypervisor, and they are eventually real-located to other domains or consumed by the idle domain.Once a vCPU is descheduled, the guest OS should be capa-ble to migrate the interrupts to a running vCPU to avoidprocessing delay. So it is a must for the hypervisor sched-uler to pass the necessary information to the guest OS, atleast the scheduling status of each vCPU. However, the guestOS should not make any assumption about the schedulingstrategies of the underlying hypervisor, or instruct the hy-pervisor to schedule a specific vCPU. The hypervisor alwayshas a higher privilege than the guest OS. For security rea-sons, the control flow can only go from the hypervisor to theguest OS, instead of the opposite direction.

3.3 vBalance ArchitecturevBalance is a cross-layer design residing in both guest

OS layer and hypervisor layer, as shown in Figure 2. Thecomponent in hypervisor layer is responsible to inform theguest kernel the scheduling status of each vCPU. The in-terrupt remapping module in the guest OS always tries tobind the interrupts to a running vCPU, meanwhile achievesbalanced interrupt load. vBalance does not violate Xen’ssplit-driver model [15], in which the frontend communicateswith a counterpart backend driver in the driver domain, viashared memory and event channels. The frontend driver inthe guest domain can be netfront, blkfront, etc. In the driverdomain, there may exists data multiplexing/demultiplexingactivities between the backend and hardware device driver.

backend

device

driver

frontend

Applications

Xen

Driver domain Guest domain (SMP-VM)

Hypervisor scheduler

sched_infovCPU vCPU vCPUvCPU

device

Remap

… vBalance

Figure 2: vBalance architecture

For example, the netback driver usually adopts software net-work bridge [8] or virtual switch utilities [12, 18] to multiplexthe network packets from/to guest domains. Since vBalancedoes not modify the driver domain, we do not explicitly showthe multiplexer component in the figure.

3.3.1 Hypervisor SchedulerXen uses a credit scheduler [4] to allocate CPU time pro-

portionally according to each VM’s weight. The vCPUs oneach physical core are sorted according to their prioritiesand remaining credit. If a vCPU has consumed more thanits allocation, it will get an OVER priority, otherwise it keepsan UNDER priority. Each vCPU will receive a 30ms time sliceonce it gets scheduled. In order to improve VM’s I/O perfor-mance, the credit scheduler introduces a boost mechanism.The basic idea is to temporarily give the vCPU that receivesexternal events a BOOST priority with preemption, which ishigher than other vCPUs in UNDER and OVER state. Creditscheduler supports automatic load balancing to distributevCPUs across all available physical cores.

Normal Path. In most cases, the external events forSMP-VMs can be properly delivered and timely processedby the targeted vCPU, as the guest OS guarantees that thereceiver is always one of the running vCPUs. So for a SMP-VM, as long as there is at least one vCPU running on thephysical core when external events arrive, the processingdelay will not happen. We call it normal path as no extrareschedule operations are needed, and the current hypervisorscheduler can easily satisfy this scenario without any mod-ification. Since a SMP-VM has more than one vCPU, thelikelihood that all vCPUs are offline at the same moment ismuch smaller than that of a UP-VM.

Fast Path. The only circumstance the hypervisor sched-uler needs to consider is that, for a SMP-VM, it is possiblethat all vCPUs are waiting in the runqueues when externalevents arrive. In this rare case, the hypervisor schedulerneeds to explicitly force a reschedule to give the targetedvCPU the opportunity to run. We call it fast path. Ofcourse, the more vCPUs a SMP-VM has, the more probablythe event handling follows the normal path. It also dependson the proportion of CPU time that is allocated to the VM,and the density of co-located VMs.

The limitation of current boost mechanism is that it onlyboosts the blocked vCPU with the condition that it has

not used up its credit. This will introduce scheduling de-lay for the vCPUs which are already waiting in the run-queue. Based on the credit scheduler, we introduce anotherpriority SMP_BOOST with preemption, which is higher thanall the other priorities. For a SMP-VM, only when none ofits vCPUs can get a scheduling opportunity from the normalpath, the targeted vCPU receives a SMP_BOOST priority andget scheduled at once.

3.3.2 Interrupt Remap Module in the Guest OSThe remap module of vBalance resides in the guest OS,

which is detailedly shown in Figure 3. Generally, it containsthree components: the IRQ Load Monitor is responsible tocollect interrupt load statistics of each vCPU; the BalanceAnalyzer reads scheduling states of all vCPUs, and uses theIRQ statistics to determine whether an interrupt imbalancehas happened; once an imbalance is identified, the interruptswill be migrated to another online vCPU, and the IRQ MapManager will take the remap action.

IRQ Load Monitor

IRQ MapManager

BalanceAnalyzer

vCPU

sched_info

vCPU vCPU vCPU…

IRQ statistics

Remap action

Figure 3: vBalance remap module in guest OS

Time-based interrupt load balance heuristic. Themethod for balancing interrupt load among multiple vCPUsbasically involves generating a heuristic to assess the inter-rupt imbalance. It can be based on one or more interruptsources such as network load or disk load. When the heuris-tic satisfies a certain criterion, the interrupt imbalance isidentified and a new mapping of interrupts to vCPUs is gen-erated. The heuristic can be generated based on one or morequantities, including but not limited to, the average load perinterrupt on vCPUs, or the total load of each vCPU. Sincethe interrupt load on each vCPU changes along with time,the system heuristic must be determined at different times,resulting in a time-based system heuristic. A baseline heuris-tic is used to compare a given value to determine whether asufficiently large interrupt imbalance has occurred.Measure an interrupt imbalance. In the guest OS,

using interrupt load data for vCPUs, a measurement of howbalanced the interrupts are across them can be determined.Suppose that a SMP-VM has n vCPUs, denoted as the setV C = {vc0, vc1, ..., vcn−1}. For each vCPU vci ∈ V C(0 ≤i ≤ n−1), let ld(vci)t denote its interrupt load at time t. Weuse the average interrupt load of all vCPUs as the baseline,which can be expressed and calculated in equation 1.

ldavg(vc)t = avg(ld(vc0)t, ld(vc1)t, ..., ld(vcn−1)t) (1)

There are k online vCPUs (0 ≤ k ≤ n) at time t, de-

noted as OV Ct = {ovc0, ovc1, ..., ovck−1}, OV Ct ⊆ V C. Ifa rebalance operation is needed, these online vCPUs are ac-tually the eligible vCPUs for interrupts to be mapped to.The online vCPUs are classified based on their current in-terrupt load. Figure 4 shows the state transition diagram ofeach vCPU. Generally there are two states for each vCPU:online and offline, depending on whether the vCPU isrunning on the physical core or not. To manage the onlinevCPUs more efficiently, each online vCPU has three sub-states: HOLD, LOW and HIGH. The vCPU that is holding theinterrupts is labeled as HOLD; for the others, if its current in-terrupt load is above the average level, it is labeled as HIGH;otherwise, it is labeled as LOW. In this way, the vCPUs can besorted in a more efficient way. The average load ldavg(vc)tcan be updated less frequently.

offline

LOW

HIGH

HOLD

online

IRQ is released && IRQ Load < average

IRQ Load is the lowest && imbalance > thresholddescheduled

by hypervisor

scheduled by hypervisor IRQ Load >

average

IRQ Load < average

IRQ Load is the lowest && the previous holder is offline

IRQ is released && IRQ Load > average

Figure 4: vCPU state machine diagram

To remap the interrupts from one vCPU to another, thereare two scenarios that need to be considered: first, the origi-nal interrupt holder is already descheduled, then there mustbe a remap operation; second, the original interrupt holderis still running on the physical core, however, a sufficientlylarge interrupt imbalance has occurred, thereby triggeringan interrupt migration. To determine an imbalanced condi-tion, we define an imbalance ratio R:

Rt =ldmax(ovc)tldmin(ovc)t

(2)

If Rt is detected to exceed a predefined threshold value,the interrupts will be migrated to the online vCPU whichhas the minimum interrupt load. Once the new mapping isgenerated, the hypervisor is notified through hypercalls torebind the event channels to the new vCPU. The detailedoperations are presented in Algorithm 1.

4. IMPLEMENTATIONWe have implemented a prototype of vBalance using 64-

bit Xen 4.1.2 as the hypervisor and 64-bit Linux 3.2.2 asthe paravirtualized guest operating system. The design ofvBalance is generic and hence applicable to many other hy-pervisors (e.g. KVM [26], VMware ESX [41] and Hyper-V[9]). Our implementation aims to minimize the footprint ofvBalance by reusing existing Xen and Linux code as muchas possible. vBalance only requires a few modifications tothe hypervisor layer, involving about 150 lines of code. Theimplementation in Linux guest is highly modular, containingless than 300 lines of code.

Algorithm 1: vBalance algorithm

Let prev vc denote the previous vcpu that holds the IRQs;Let next vc denote the next vcpu that will hold the IRQs;

/* precompute the average IRQ load of all vcpus */ldavg(vc)t ← avg(ld(vc0)t, ld(vc1)t, ..., ld(vcn−1)t);

Get online vcpus set OV Ct;Sort online vcpus by their interrupt loads ld(ovc)t;

/* get the max/min IRQ load of onine vcpus */ldmax(ovc)t ← max(ld(ovc0)t, ld(ovc1)t, ..., ld(ovck−1)t);ldmin(ovc)t ← min(ld(ovc0)t, ld(ovc1)t, ..., ld(ovck−1)t);for each online vcpu ovci do

if ld(ovci)t > ldavg(vc)t thenovci.state← HIGH;

elseovci.state← LOW ;

endend

if prev vc is not online then/* must map the irq to an online vcpu */

next vc← {ovc | ld(ovc)t = ldmin(ovc)t};else

if prev vc.state is HIGH thenif ldmax(ovc)t > ldmin(ovc)t ∗ threshold then

/* the imbalance threshold is reached */next vc← {ovc | ld(ovc)t = ldmin(ovc)t};

else /* no need to remap */next vc← prev vc;

end

else /* no need to remap */next vc← prev vc;

end

end

if next vc != prev vc thenrebind irq to next vc; ; /* final operation */

end

4.1 Xen HypervisorThe first modification is in shared_info data structure.

The shared_info page is accessed throughout the runtimeby both Xen and the guest OS. It is used to pass infor-mation relating to the vCPUs and VM states between theguest OS and the hypervisor, such as memory page tables,event status, wallclock time, etc. So it is natural that weput the vCPU scheduling status information in shared_info

data structure and pass it to the guest OS at runtime. Thesched_info simply includes an unsigned long integer type,out of which each bit represents the scheduling status of onevCPU (0 – offline, 1 – online).# scheduling status of smp vcpus

typedef struct {

unsigned long vcpu_mask;

} sched_info;

The value of sched_info will be updated when the func-tion schedule() is called in the hypervisor scheduler. If thevCPU does not belong to the idle domain, the correspondingbit of vcpu_mask is updated before the context switch.The second modification is to add fast path scheduling

support for SMP guest domains. This involves a small changeto Xen’s credit scheduler [4], by adding another wake upbranch for SMP vCPUs. We do not make any modificationto Xen’s split driver model and event channel mechanism.Specifically, in evtchn_send function, when it finds that the

value of vcpu_mask is zero which means that there are noonline vCPUs, the targeted vCPU will enter fast path towake up. Eventually, it will be given a SMP_BOOST prioritywith preemption, behaves as the following.# fast path for smp vcpus to wake up

static void csched_smp_vcpu_wake (struct vcpu *vc)

{

struct csched_vcpu * const svc = CSCHED_VCPU(vc);

const unsigned int cpu = vc->processor;

if (__vcpu_on_runq(svc)) {

__runq_remove(svc);

}

svc->pri = CSCHED_PRI_TS_SMP_BOOST;

__runq_insert(cpu, svc);

# tickle the cpu to schedule svc using softirq

__runq_tickle(cpu, svc);

}

4.2 Guest Operating SystemFor the implementation in the guest OS, there are two

technical challenges: (1) should vBalance be implementedas a user-level daemon or a part of kernel code? (2) how todetermine the interrupt load of each vCPU?

4.2.1 User-level or Kernel-levelOne possible approach is to implement vBalance as a user-

level LSB (Linux Standard Base) daemon. The schedulingstatus information of vCPUs vcpu_mask can be accessed bya user-level program through a pseudo device residing in/dev or /proc file system. In this way, the footprint at thekernel-level is minimized. The most appealing benefit of thisapproach is high flexibility. Programmers can easily changethe load balancing policy, function parameters, etc. The ef-fort needed to develop and debug the program is also muchless than that of kernel programming. For example, pro-grammers can read interrupt binding information directlyfrom /proc/interrupts and change them by simply writ-ing to /proc/irq/irq#/smp_affinity.

A very important problem of this approach is how to de-termine rebalance interval. Linux irqbalance [7] uses a 10seconds rebalance interval, which may not be appropriate invirtual machines, because the program needs to know whichvCPUs are online and which are not, in a real-time man-ner. Of course the user-level program can periodically readthe pseudo device to get the scheduling status informationof each vCPU. However, since the hypervisor can updatevcpu_mask in the magnitude of milliseconds and it is totallytransparent to the user space, the program must poll thepseudo device very frequently to sense the change, whichwill consume a large amount of computing resources. Be-sides, if the vCPU that the program is running on was justpreempted by the hypervisor, the rebalance function wouldbe stalled.

Therefore, we argue that the implementation should re-side in the kernel space. For paravirtualized guest kernel,once the vCPU is scheduled, an upcall will be executed tocheck the pending events. In Linux guest for Xen, it isxen_evtchn_do_upcall. Though the vCPU is not able toknow when it will be preempted, it exactly knows when itis scheduled again, and thus gets the updated schedulingstatus at the right time.

4.2.2 Determine the Interrupt Load of vCPUsThere are two types of interrupt load statistics for each

processor: (1) the number of interrupts received, (2) thetime spent on processing all received interrupts. In Linux,the results can be obtained by reading /proc/interrupts

and /proc/stat directly. 1 In most cases, there are actu-ally only two interrupt sources that need to be considered:network interrupt and disk interrupt. For the other inter-rupts like virtual timer, virtual console, etc., the CPU cyclesthey consume can be nearly ignored. On the micro level,the CPU time that two interrupts consume may be differ-ent, even out of the same type. This is because the dataeach interrupt carries can be of very different sizes. Butfrom a long-term view, if the two vCPUs have processed rel-atively the same amount of interrupts, the CPU cycles theyconsume should be approximately the same. So both typesof interrupt load statistics can be used as the measurementmethod. For simplicity, we choose the second type whichcounts the CPU cycles the processor has spent on handlingall interrupts received.

4.2.3 vBalance ComponentsvBalance contains several components in Linux guest ker-

nel, with highly modular implementation. We will introducethem in detail one by one.vBalance_init() is used to obtain the interrupt numbers

from specific interrupt handlers, as they are dynamically al-located in guest kernel. For network interrupt, the IRQ han-dler is xennet_interrupt; for disk interrupt, the IRQ han-dler is blkif_interrupt. As a prototype implementation,we do not consider the VM that is configured with multiplevirtual NICs or multiple virtual disks, which is likely to beour future research direction.vBalance_monitor() reads sched_info from the shared

memory page, and collects interrupt statistics of each vCPUfrom time to time. In Linux kernel 3.2.2, ten types of statis-tics are supported for each processor, including the time ofnormal processes executing in user mode (user), the time ofprocesses executing in kernel mode (system), etc. We onlyuse two types of them that are directly related to interruptprocessing: the time of servicing hardware interrupts (irq)and the time of servicing softirqs (softirq).vBalance_analyze() determines whether there is a need

to remap the interrupts to another vCPU, based on the in-formation from vBalance_monitor(). There are two con-ditions: (1) if the interrupt-bind vCPU is already offline,there must be a remap operation, and we simply select thevCPU that currently has the lowest interrupt load; (2) ifthe interrupt-bind vCPU is still online, the remap operationis done only when a severe imbalance among vCPUs is de-tected, as stated in Section 3.2.2. The imbalance ratio Rt iscompared with a predefined threshold value. Currently wedefine thredhold to be 1.5, which means that the maximuminterrupt load of one vCPU will not exceed 1.5 times of theminimum interrupt load.vBalance_remap() is responsible to carry out the final

interrupt remap operation. There are two existing func-

1To enable interrupt time accounting, the option CON-FIG_IRQ_TIME_ACCOUNTING must be opened when compilingLinux 3.2.2 kernel source code. Other Unix-like operatingsystems also have similar IRQ statistics functionalities.

tions irq_set_affinity() and rebind_irq_to_cpu() thatcan be used directly. The first one is quite high-level forgeneral-purpose use, while the second one is relatively low-level and designed particularly for paravirtualized guest. Forefficiency, we use rebind_irq_to_cpu to inform the hypervi-sor the change of event-notified vCPU. There is a corner casethat must be handled: since vBalance is executed with theinterrupts disabled, if an external event arrives before theinterrupts are enabled again, the corresponding interrupthandler will not be invoked. So after the function finishes,we need to check the event status again and see whether itis needed to explicitly call the handler.

5. EVALUATIONIn this section, we present the evaluation of our Xen-based

vBalance prototype in detail. Both micro-benchmarks andapplication-level benchmarks are used to evaluate the effec-tiveness of vBalance.

5.1 System ConfigurationOur experiments are performed on several Dell PowerEdge

M1000e blade servers, connected with a Gigabit Ethernetswitch. Each server is equipped with two quad-core 2.53GHzIntel Xeon 5540 CPUs, 16GB physical memory, two GbEnetwork cards and two 250GB SATA hard disks (RAID-1configured). We use a 64-bit Xen hypervisor (version 4.1.2)and a 64-bit Linux kernel (version 3.2.2) as the guest OS.

We run a set of experiments on the following three systemsfor the purpose of comparison:

• Xen-native: unmodified Xen 4.1.2 hypervisor withvanilla Linux 3.2.2 guest OS, as the baseline.

• Irqbalance: unmodified Xen 4.1.2 hypervisor withvanilla Linux 3.2.2 guest OS, running a traditionalirqbalance [7] daemon inside.

• vBalance: with vBalance deployed in both Xen 4.1.2hypervisor and Linux 3.2.2 guest OS.

5.2 Micro-level BenchmarksThis section presents the evaluation results of network

performance experiments using micro-level benchmarks. Thegoal of the tests is to answer one question: even if the SMP-VM gets CPU resource, can CPU cycles be properly used toserve I/O requests? The SMP-VM under test was config-ured with 4 vCPUs and 2GB physical memory. We createdan extreme case for stress tests by pining all four vCPUs toone physical core, to see how serious the problem is and howmuch improvement vBalance can achieve. The setup guar-antees that: (1) the SMP-VM can always get CPU cycles byowning a dedicated core; (2) the vCPUs are stacked in onerunqueue so that every vCPU will suffer the queueing de-lays. To trigger the vCPU scheduling in the hypervisor, weran a four-threaded CPU burn script in the guest to makeall vCPUs runnable. The driver domain ran on separatedcores so that it will not affect the guest domains.

5.2.1 Network LatencyTo gain a micro-scope view of network latency behavior,

we used ping with 0.1 second interval (10 ping operationsper second) to measure the round trip time from one physi-cal server to the SMP-VM. Each test lasted for 60 seconds.

0

20

40

60

80

100

120

0 10 20 30 40 50 60

Pin

g R

TT

(m

s)

Time (s)

(a) Xen native

0

20

40

60

80

100

120

0 10 20 30 40 50 60

Pin

g R

TT

(m

s)

Time (s)

(b) with irqbalance

0

1

2

3

4

5

0 10 20 30 40 50 60

Pin

g R

TT

(m

s)

Time (s)

(c) with vBalance

Figure 5: Ping RTT evaluation results

0

200

400

600

800

1000

0.5 1 2 4 8 16 32 64 128 256 512

Ave

rag

e T

hro

up

hp

ut

(Mb

ps)

Message Size (KB)

xen-nativewith-irqbalancewith-vBalance

(a) TCP Bandwidth (TCP_STREAM test)

0

200

400

600

800

1000

0.5 1 2 4 8 16 32

Ave

rag

e T

hro

up

hp

ut

(Mb

ps)

Message Size (KB)


(b) UDP Bandwidth (UDP_STREAM test)

0

800

1600

2400

3200

4000

64/1

K

64/2

K

64/4

K

64/8

K

256/

1K

256/

2K

256/

4K

256/

8K

512/

1K

512/

2K

512/

4K

512/

8K

Ave

rag

e T

hro

up

hp

ut

(tra

ns./

s)

Request Size / Response Size (bytes)


(c) TCP Transactions (TCP_RR test)

0

800

1600

2400

3200

4000

64/1

K

64/2

K

64/4

K

64/8

K

256/

1K

256/

2K

256/

4K

256/

8K

512/

1K

512/

2K

512/

4K

512/

8K

Ave

rag

e T

hro

up

hp

ut

(tra

ns./

s)

Request Size / Response Size (bytes)


(d) UDP Transactions (UDP_RR test)

Figure 6: Netperf throughput evaluation results

Figure 5 shows the testing results of RTTs. It can be seenthat with default Xen and Linux irqbalance, the RTT largelyvaries, with 90ms peak reached. This can be explained thatXen’s credit scheduler use a 30ms time slice, so the maxi-mum scheduling latency of each vCPU of a 4vCPU-VM is3×30 ms. vBalance can effectively reduce the network RTTsto less than 0.5ms, with no spikes observed. We can drawthe conclusion that with Xen’s credit scheduler: (1) eventhe whole SMP-VM always gets CPU resource, the largenetwork latencies can still happen if the vCPU responsi-ble for network interrupt can not get CPU cycles when I/Oevents arrive; (2) Linux irqbalance running in the VM doesnot show any benefit in reducing network latency.

5.2.2 Network ThroughputWe used netperf [10] to evaluate both TCP and UDP net-

work throughput. The SMP-VM hosted a netserver processand we ran a netperf process as the client on another phys-ical machine. We wrote a script to automate the testingprocess. Each test lasted for 60 seconds and was repeatedfor three times to calculate the average value.Figure 6 (a) and (b) presents bandwidth testing results,

using varied message sizes. The TCP_STREAM evaluation re-sults show that vBalance achieves more than twice through-put compared with default Xen, while irqbalance shows noapparent performance improvement. The results of UDP_STR-EAM is even better than that of TCP_STREAM, with maximumperformance reaching 2.8 times of default Xen. This is be-cause TCP has to generate bidirectional traffic (SYN andACK packets) to maintain reliable communication, whereasUDP traffic is unidirectional, therefore it can transmit moredata. Figure 6 (c) and (d) show the transaction results forvaried sizes of request/response combinations. The combi-nations are designed to reflect the fact that the sizes of net-work request packets are usually small whereas the replieddata from the servers can be quite big. The transaction ratedecreases along with the increase of request packet size andresponse packet size. Our solution significantly outperformsboth default Xen and irqbalance in all test cases. Comparedwith default Xen, in TCP_RR tests, the maximum improve-ment of vBalance is 86%, reached at “512/2K” test case; inUDP_RR tests, vBalance maximally achieves 102% improve-ment, also reached at “512/2K” test case.

5.3 Application-level BenchmarksIn this section, we use two application-level benchmarks

to measure the performance improvement of vBalance. TheSMP-VM under test was configured with 4 vCPUs. Thistime, we allocated four physical cores to host the SMP-VM. To create the consolidate scenario, we ran eight CPU-intensive UP-VMs on the same four physical cores as thebackground. So on average, there were two co-located vC-PUs per physical core to compete CPU cycles with the SMP-VM’s vCPUs. All vCPUs were subjected to the load bal-ancing activity of Xen’s credit scheduler. As a routine, thedriver domain ran on separate cores so that they would notaffect the guest domains.

5.3.1 Apache Olio

Xen hypervisor

Apache Olio+

Memcached

VM…

… VM

VM

VM

Faban (Driver)+

Tomcat (Geocoder)

MySQLServer

SMP-VM

under test

NFS

Server

Background

VMs

Figure 7: The experimental setup of Apache Oliobenchmark

Apache Olio [2, 37] is a Web 2.0 toolkit to help evalu-ate the suitability and performance of web technologies. Itfeatures a social-event calendar, where users can create andtag events, search, send comments, etc. Apache Olio in-cludes a Markov-based distributed workload generator anddata collection tools. We use Apache Olio (version 2.0)PHP5 implementation for our experiments which containsfour components: (1) an Apache web server acting as theweb front-end to process requests, (2) a MySQL server thatstores user profiles and event details, (3) an NFS server thatstores user files and event specific data, and (4) a Faban [6]driver that generates the workload to simulate the externalusers’ requests.Figure 7 shows our testbed configuration. Except from

Olio application and Memcached server, the other compo-nents were deployed on physical machines. The SMP-VMwas configured with 4GB physical memory, with 1GB usedfor Memcached service. We configured Faban driver to runfor 11 minutes, including 30-second ramp up, 600-secondsteady state and 30-second ramp down. Two agents wereused to control 400 concurrent users to generate differenttypes of requests, with each agent taking care of 200 users.For fairness, MySQL server was configured to reload databasetables in every test. We evaluated the number of operationsand average response times performed by Apache Olio.Table 1 shows the total count of different operations per-

formed by Apache Olio. Regarding the overall rate (ops/sec),vBalance achieves 13.0% and 13.2% performance improve-ment respectively, compared with default Xen and irqbal-ance. Regarding each single operation, the maximum im-

provement is 20.6% compared with default Xen (reachedat Login operation), and 16.9% compared with irqbalance(reached at AddPerson operation). One important observa-tion is that without vBalance, many operations are reportedto be failure. For example, Login operation with default Xenreported connection timeout error for 565 times, taking up13.9% out of all login trials. This should be caused by toolong scheduling delays of the corresponding vCPU. With ourvBalance, we saw no failure case.

The results of average response times of each operation arepresented in Table 2. The significant improvements are notsurprising in that the scheduling latencies of vCPUs fromhypervisor will eventually translate into response delays touser requests. With vBalance, the response times were sig-nificantly reduced. The maximum improvements are ob-served at Login operation for both default Xen and irqbal-ance, 90.6% and 91.0% respectively. The results respondto the large number of connection timeout error of Loginoperation in Table 1.

Figure 8 (a) shows the CPU cycles each vCPU consumedduring Apache Olio tests. It is foreseeable that default Xenutilizes vCPU0 much more than the other vCPUs, as allinterrupts are delivered to vCPU0 by default. However,irqbalance also cannot achieved balanced load among vC-PUs. This may be explained that irqbalance makes rebal-ance decision not only based on vCPU load statistics, butalso on cache affinity, power consumption, etc. As the guestOS is not able to obtain these hardware-related informa-tion, the decisions that irqbalance makes are mostly inap-propriate. Figure 8 (b) shows the context switch times hap-pened on the four physical cores. The results of default Xenand irqbalance are very close. This makes sense as irqbal-ance wakes up every ten seconds, so its effect on contextswitch times can be nearly ignored. Since vBalance intro-duces a fast path mode for SMP-VM’s vCPUs in hypervisorscheduler, it causes more context switches. Compared withdefault Xen, vBalance only introduced 69% more contextswitches, keeping 30ms time slice unchanged. This provesthat the design of vBalance in the hypervisor level is reallylight-weight.

5.3.2 Dell DVDStore

DellDVDStore

VM…

… VM

VM

VM

Background VMs SMP-VM

under test

Driver Machine

MySQLServer

Xen hypervisor

Figure 9: The experimental setup of Dell DVDStorebenchmark

Dell DVDStore [5] is an OLTP-style e-commerce appli-cation that simulates users browsing an online DVD storeand purchasing DVDs. This application includes a back-enddatabase component, a web application layer and a driverprogram. The driver program simulates users logging in,

Table 1: Apache Olio benchmark results – throughput (count: success/failure)Operation 1. Xen native 2. with irqbalance with vBalance Improv–1 Improv–2HomePage 11177/12 11043/12 12493/0 + 11.8% + 13.1%

Login 4076/565 4227/382 4917/0 + 20.6% + 16.3%TagSearch 14271/8 14297/16 16142/0 + 13.1% + 12.9%EventDetail 10580/6 10424/9 11791/0 + 11.4% + 13.1%PersonDetail 1091/1 1110/3 1241/0 + 13.7% + 11.8%AddPersion 358/0 349/0 408/0 + 14.0% + 16.9%AddEvent 841/1 865/3 903/0 + 7.4% + 4.4%

Rate(ops/sec) 70.657 70.525 79.825 + 13.0% + 13.2%

Table 2: Apache Olio benchmark results – average response times (seconds)Operation 1. Xen native 2. with irqbalance with vBalance Reduct–1 Reduct–2HomePage 1.387 1.346 0.212 – 84.7% – 84.2%

Login 0.958 1.002 0.090 – 90.6% – 91.0%TagSearch 1.657 1.736 0.306 – 81.5% – 82.4%EventDetail 1.438 1.462 0.244 – 83.0% – 83.3%PersonDetail 1.825 1.785 0.339 – 81.4% – 81.0%AddPersion 2.852 3.012 0.800 – 71.9% – 73.4%AddEvent 3.341 3.722 0.972 – 70.9% – 73.9%

0

100

200

300

400

500

600

xen-native with-irqbalance with-vBalance

Run T

ime (

seconds)

vCPU0vCPU1vCPU2vCPU3

(a) The runtime of each vCPU

0.0x100

2.0x106

4.0x106

6.0x106

8.0x106

1.0x107

1.2x107

0 100 200 300 400 500 600

Conte

xt S

witch T

imes

Time (s)


(b) Context switch overhead

Figure 8: The hypervisor statistics of Apache Olio benchmark

browsing for DVDs by title, actor or category, adding se-lected DVDs to their shopping cart, and then purchasingthose DVDs. We chose DVDStore (version 2.1) PHP5 im-plementation to evaluate the effectiveness of vBalance. Thesetup is shown in Figure 9. We set MySQL database size tobe 4GB. We started 16 threads simultaneously to saturatethe application server’s capability. Each test lasted for 11minutes, including 1-minute warm up and 10-minute stablerunning. The driver reports testing results every 10 seconds.Figure 10 (a) shows the results of average throughput.

Figure 10 (b) presents average request latency. vBalanceimproves the average throughput from ∼210 orders/sec to∼260 orders/sec (about 24% improvement), and reduces theaverage request latencies from ∼60ms to ∼46ms (about 30%improvement). Figure 11 (a) shows the CPU cycles eachvCPU consumed during the three tests. Compared withApache Olio results in Figure 8 (a), the SMP-VM’s un-balanced vCPU load is more serious. This is because: 1)Apache Olio tests can be configured to use a Memcachedserver to store recently used data, so the application doesnot need to read from NFS server and thus reduces networktraffic; 2) Apache Olio can preload database tables (userprofiles, events information, etc.) into memory before theFaban driver sends user requests; during the website oper-

ations, application server almost does not need to botherMySQL server and therefore also saves lots of network traf-fic. The situation of Dell DVDStore is quite different: everyuser request will trigger the database operations to MySQLserver, and eventually translates into network traffic. ThevCPU0 was bothered much more frequently by interruptsthan that in Apache Olio tests. Figure 11 (b) shows theresults of context switch times on the four physical coreshosting the experiments. The default Xen performed verysimilar to irqbalance. vBalance only introduced 35% morecontext switch overhead than default Xen.

6. RELATED WORK

6.1 Hypervisor Scheduling

6.1.1 Scheduling for I/OSignificant effort has been paid to address the VM’s I/O

performance problem in recent years. Task-level solutions[25, 34] map I/O bound tasks from the guest OS directlyto physical CPUs. Since the characteristics of tasks mayvary from time to time, the solution often needs additionalefforts to predict the changes. Besides, the implementationinvolves heavy modifications from the guest OS scheduler to

0

50

100

150

200

250

300

350

400

100 200 300 400 500 600

Avera

ge T

hro

ughput (o

rders

/s)

Time (s)


(a) Average throughput

0

20

40

60

80

100

100 200 300 400 500 600

Avera

ge L

ate

ncy (

ms)

Time (s)


(b) Average request latency

Figure 10: The evaluation results of Dell DVDStore benchmark

0

100

200

300

400

500

600

xen-native with-irqbalance with-vBalance

Run T

ime (

seconds)

vCPU0vCPU1vCPU2vCPU3

(a) The runtime of each vCPU

0.0x100

5.0x106

1.0x107

1.5x107

2.0x107

2.5x107

0 100 200 300 400 500 600

Conte

xt S

witch T

imes

Time (s)


(b) Context switch overhead

Figure 11: The hypervisor statistics of Dell DVDStore benchmark

the hypervisor scheduler. Several non-intrusive approaches[22, 27, 32, 45] use real-time schedulers to shorten the vCPUscheduling latencies for event processing. First, they requirecomplex configurations and careful parameter tuning andselection, which may not be feasible in a cloud environmentwith dynamic placement and migration of VMs. Our solu-tion does not introduce any extra parameter or special con-figurations, thus does not bring additional management bur-den. Second, to meet the deadlines of some VMs, the timeslice they use is very small (less than 1ms in most cases),which will excessively increase the context switch overhead.Our solution does not change the time slice (30ms) usedin Xen’s credit scheduler. Research [30] proposes to run apolling thread to monitor the event status, and schedulesthe vCPU as soon as possible once the events arrive. Thedrawback is that an additional physical core is needed torun the polling thread, resulting in low resource utilization.Research [23, 46] propose to dynamically reduce the timeslice length used in the hypervisor scheduler, so as to de-crease the waiting time of each vCPU in the runqueue. Ap-parently, these approaches will exponentially increases thecontext switch times among VMs. For instance, when usinga time slice of 0.1ms as suggested in research [23], theoreti-cally the context switch overhead becomes 300 times biggerthan that of xen’s credit scheduler (30ms)!Another common problem of the above solutions is that

none of them is designed for SMP-VMs. Without the co-operation from the guest OS to balance the interrupts loadamong multiple vCPUs, the potential to improve I/O per-formance at the hypervisor layer is quite limited.

6.1.2 Co-scheduling for SMP Virtual MachinesMost research papers with SMP-VMs focus on “synchro-

nization” problem inside the guest OS. The lock contentionproblem among multiple processors is already a well-knownproblem, which will degrade the parallel application’s per-formance by serializing its execution. However, with the OSrunning in a VM, the synchronization latency of spinlocksbecomes more serious, as the vCPU holding the lock maybe preempted by the hypervisor at any time. Research [40]identifies this problem as LHP (Lock-Holder Preemption):the vCPU holding a lock is preempted by the hypervisor,resulting in much longer waiting time for the other vCPUsto get the lock. They propose several techniques to addressLHP problem, including intrusively augmenting the guestOS with a delayed preemption mechanism, non-intrusivelymonitoring all switches between user-level and kernel-levelto determine the safe preemption, and making the hyper-visor scheduler locking-aware by introducing a preemptionwindow. Research [38] proposes “balance scheduling” by dy-namically setting CPU affinity to guarantee that no vCPUsiblings are in the same CPU’s runqueue. Research [43] ad-dresses this problem by detecting the occurrence of spinlockswith long waiting times and then determine co-scheduling ofvCPUs. Research [14] uses gray-box knowledge to infer theconcurrency and synchronization of guest tasks. VMwareproposes the concept of “skew” to guarantee the synchro-nized execution rates between vCPUs [3].

All the above approaches are actually the variants of co-scheduling algorithm proposed in [33], which schedules con-current threads simultaneously to reduce the synchroniza-

tion latency. For SMP-VMs, the vCPUs from the same VMare co-scheduled. That is, if physical cores are available, thevCPUs are mapped one-to-one onto physical cores to runsimultaneously. In other words, if one vCPU in the VM isrunning, a second vCPU is co-scheduled so that they executenearly synchronously. Though it may mitigate the negativeeffect of spinlocks inside guest OS to some extent, it helps lit-tle to reduce interrupt processing delays, as vCPU schedul-ing latencies from the hypervisor are inherently unavoidable.Besides, the CPU fragmentation problem introduced can re-duce CPU utilization [28]. To our best knowledge, we are thefirst one to use interrupt load balance technique to improveI/O performance for SMP-VMs.

6.2 Linux IrqbalnaceIrqbalance [7] is a Linux daemon that distributes inter-

rupts over the processors and cores in SMP physical ma-chines. The design goal is to find a balance between powersavings and optimal performance. By periodically analyz-ing the interrupt load on each CPU (every 10 seconds bydefault), interrupts are reassigned among the eligible CPUs.Irqbalance also considers cache-domain affinity, and triesto make each interrupt stand a greater chance of havingits interrupt handler be in cache. It automatically deter-mines whether the system should work in power-mode orperformance-mode, according to the system’s workload.As the underlying execution environment for OS signifi-

cantly differs in virtualization world, the solution designedbased on physical assumptions are not effective at all. First,irqbalance has no knowledge of the scheduling status of eachvCPU, so it has no way to correctly determine which vC-PUs are eligible ones; a rebalance interval of 10 seconds istoo long, as the hypervisor schedules the vCPUs in the mag-nitude of milliseconds. Second, it is quite difficult to predictthe cache behaviors even for the hypervisor, so it is infeasi-ble for the guest OS to manage cache directly. Third, powersaving will be much more powerful to work in the hypervisor,which accesses the hardware directly.

6.3 Hardware-based Solutions for Directed I/OHardware-based approaches assign dedicated devices to

VMs and allow direct hardware access from within the guestOS. In this approaches, performance critical I/O operationscan be carried out by interacting with the hardware directlyfrom a guest VM. For example, Intel VT [39] (includingVT-x, VT-d, VT-c, etc.) provides the platform hardwaresupport for DMA and interrupt virtualization [13]. DMAremapping transforms the address in a DMA request issuedby an I/O device, which uses Machine Physical Address(MPA) to the VM’s corresponding Guest Physical Address(GPA). Interrupt remapping hardware distinguishes inter-rupt requests from specific devices and routes them to theappropriate VMs to which the respective devices are as-signed. By offloading many capabilities into hardware andsimplifying the execution, it can greatly improve the I/Operformance of VMs.However, these approaches have several limitations. First,

it sacrifices key advantages of a dedicated driver domainmodel: device driver isolation in a safe execution environ-ment avoiding guest domain corruption by buggy drivers.Instead, a virtual machine would have to include devicedrivers for a large variety of devices, increasing their size,

complexity, and maintainability. Second, it requires specialsupport in both processors and I/O devices, such as self-virtualized PCI devices [11] which present multiple logicalinterfaces, thereby increases hardware cost. Third, using theintelligent hardware for VMs is already a very complicatedtask [19, 35], it also complicates many other functionali-ties such as safety protection [44], transparent live migration[24], checkpointing/restore, etc. Last but most importantly,even with hardware support like DMA-remapping and inter-rupt migration, the processing of a interrupt still relies onwhether the targeted vCPU can be scheduled timely by thehypervisor.

7. DISCUSSIONS AND FUTURE WORKTime Slice of the Hypervisor Scheduler. Since our

solution mainly relies on the guest-level strategies to mi-grate interrupts among vCPUs, it is independent from thetime slice used in the hypervisor scheduler. Therefore, usinga longer time slice for SMP-VMs should cause less contextswitches. For the guest OS, since each vCPU can run longerat a time, the frequency of interrupt migration can be re-duced. Research [23] divides the physical cores into severalgroups (e.g. normal cores, fast-tick cores), and uses ded-icated fast-tick cores to schedule I/O bound vCPUs. Weare inspired to use “slow-tick” cores to host SMP-VMs, withvBalance running in the guest OS. Considering the “syn-chronization” problem mentioned in the related work [14,38, 40, 43], how this approach can effectively work with co-scheduling is worth investigating.

Extension of the Algorithm. The current load bal-ance strategy migrates interrupts to only one vCPU (many-to-one). This simplifies the design and implementation, andworks well when the number of interrupt sources is small(e.g. one NIC plus one disk). However, if the SMP-VM isconfigured with multiple NICs or disks, when applicationssimultaneously access them, the current solution may notoptimally balance the workload. Therefore, a many-to-manyinterrupt mapping algorithm is desirable in the future. Assuch, the historical workload of each interrupt source maybe used to predict its future pattern, and based on theseheuristics the interrupts are distributed to multiple onlinevCPUs. The problem is complicated in that a tradeoff be-tween fairness and efficiency needs to be achieved.

Cache-Awareness. Cache management is known to bea quite complicated problem for VMs, subject to vCPU mi-grations among physical cores and VM migrations amongphysical hosts. Modern CPU hardware is increasingly tobe NUMA style, with each node owning a shared LLC (LastLevel Cache) of very big size (e.g., Intel’s Nehalem [36]). Thefuture hypervisor scheduler should be cache-aware, makingvCPUs fairly benefit the hardware cache. Cache partitioning[29] has been proposed to address the conflicting accesses inshared caches. Virtual Private Caches (VPCs) [31] providehardware support for the Quality of Service (QoS) of Vir-tual Private Machines (VPM). For guest OS, it is expectedto always migrate the interrupts to cache-hot vCPUs, in-stead of cache-hold ones. In order to make the guest OSaware of whether the vCPU is cache-hot or cache-cold, theCPU-tree information needs to be passed to the guest level.How interrupt load balancing inside guest OS can benefitfrom cache-awareness is a research problem.

8. CONCLUSIONHigh performance I/O virtualization is endlessly desirable

in data-intensive cloud computing era. In this paper, we pro-pose a new I/O virtualization approach called vBalance forSMP virtual machines. vBalance is a cross-layer solution,which takes advantage of the guest-level help to accelerateI/O speed by adaptively and dynamically migrating inter-rupts from a preempted vCPU to a running one. Unliketraditional approaches which impose all pressure on the hy-pervisor scheduler, vBalance exploits the potential of guestoperating system and is quite light-weight in the hypervisor.vBalance is based on software and does not require specialhardware support. To demonstrate the idea of vBalance, wedeveloped a prototype using Xen 4.1.2 and Linux 3.2.2. Ourstress tests with micro-level benchmarks show that vBal-ance significantly reduces network latency and greatly im-proves network throughput. The experimental results withrepresentative cloud-style applications show that vBalanceeasily outperforms the original Xen, achieving much lowerresponse time and higher throughput. Overall, leveragingvBalance makes SMP virtual machines more appealing forapplications to deploy.

9. ACKNOWLEDGMENTSWe thank the anonymous reviewers for their insightful

comments. This research is supported by a Hong Kong RGCgrant HKU 7167/12E and in part by a Hong Kong UGCSpecial Equipment Grant (SEG HKU09).

10. REFERENCES[1] Amazon EC2: http://aws.amazon.com/ec2/.

[2] Apache Olio: http://incubator.apache.org/olio/.

[3] Co-scheduling SMP VMs in VMware ESX Server:http://communities.vmware.com/docs/doc-4960.

[4] Credit Scheduler:http://wiki.xensource.com/xenwiki/creditscheduler.

[5] Dell DVD Store: http://linux.dell.com/dvdstore/.

[6] Faban: http://java.net/projects/faban/.

[7] Irqbalance: http://www.irqbalance.org/.

[8] Linux bridge:http://www.losurs.org/docs/ldp/howto/pdf/bridge-stp-howto.pdf.

[9] Microsoft Hyper-V Server:http://www.microsoft.com/hyper-v-server/.

[10] Netperf: http://www.netperf.org/.

[11] PCI-SIG I/O virtualization specifications:http://www.pcisig.com/specifications/iov/.

[12] VMware Virtual Networking Concepts:http://www.vmware.com/files/pdf/virtual network-ing concepts.pdf.

[13] Intel virtualization technology for directed i/o.Architecture Specification - Revision 1.3, IntelCorporation, February 2011.

[14] Y. Bai, C. Xu, and Z. Li. Task-aware basedco-scheduling for virtual machine system. InProceedings of the 2010 ACM Symposium on AppliedComputing (SAC), pages 181–188.

[15] P. Barham, B. Dragovic, K. Fraser, S. Hand,T. Harris, A. Ho, R. Neugebauer, I. Pratt, and

A. Warfield. Xen and the art of virtualization. InProceedings of the 19th ACM Symposium on OperatingSystems Principles (SOSP), volume 37, pages164–177, 2003.

[16] S. K. Barker and P. Shenoy. Empirical evaluation oflatency-sensitive application performance in the cloud.In Proceedings of the first annual ACM SIGMMconference on Multimedia systems (MMSys), pages35–46, 2010.

[17] F. Bellard. QEMU, a fast and portable dynamictranslator. In USENIX 2005 Annual TechnicalConference, pages 41–41.

[18] P. Ben, P. Justin, K. Teemu, A. Keith, C. Martin, andS. Scott. Extending networking into the virtualizationlayer. In HotNets-VIII, 2009.

[19] M. Ben-Yehuda, J. Xenidis, M. Ostrowski, K. Rister,A. Bruemmer, and L. V. Doorn. The price of safety:Evaluating iommu performance. In The 2007 OttawaLinux Symposium (OLS), pages 9–19.

[20] L. Cheng and C.-L. Wang. Network performanceisolation for latency-sensitive cloud applications.Future Generation Computer Systems, 2012.

[21] L. Cheng, C.-L. Wang, and S. Di. Defeating networkjitter for virtual machines. In The 4th IEEEInternational Conference on Utility and CloudComputing (UCC), pages 65–72, 2011.

[22] S. Govindan, A. R. Nath, A. Das, B. Urgaonkar, andA. Sivasubramaniam. Xen and co.:communication-aware cpu scheduling for consolidatedxen-based hosting platforms. In Proceedings of the 3rdInternational Conference on Virtual ExecutionEnvironments (VEE), pages 126–136, 2007.

[23] Y. Hu, X. Long, J. Zhang, J. He, and L. Xia. I/oscheduling model of virtual machine based onmulti-core dynamic partitioning. In Proceedings of the19th ACM International Symposium on HighPerformance Distributed Computing (HPDC), pages142–154, 2010.

[24] A. Kadav and M. M. Swift. Live migration ofdirect-access devices. ACM SIGOPS OperatingSystems Review, 43:95–104, July 2009.

[25] H. Kim, H. Lim, J. Jeong, H. Jo, and J. Lee.Task-aware virtual machine scheduling for i/operformance. In Proceedings of the 5th ACMSIGPLAN/SIGOPS International Conference onVirtual Execution Environments (VEE), pages101–110, 2009.

[26] A. Kivity, Y. Kamay, D. Laor, U. Lublin, andA. Liguori. kvm: the linux virtual machine monitor. InThe 2007 Ottawa Linux Symposium (OLS), pages225–230.

[27] M. Lee, A. S. Krishnakumar, P. Krishnan, N. Singh,and S. Yajnik. Supporting soft real-time tasks in thexen hypervisor. In Proceedings of the 6th ACMSIGPLAN/SIGOPS International Conference onVirtual Execution Environments (VEE), volume 45,pages 97–108, 2010.

[28] W. Lee, M. Frank, V. Lee, K. Mackenzie, andL. Rudolph. Implications of i/o for gang scheduled

workloads. In Job Scheduling Strategies for ParallelProcessing, volume 1291, pages 215–237, 1997.

[29] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, andP. Sadayappan. Gaining insights into multicore cachepartitioning: Bridging the gap between simulation andreal systems. In IEEE 14th International symposiumon High Performance Computer Architecture (HPCA),pages 367–378, 2008.

[30] J. Liu and B. Abali. Virtualization polling engine(vpe): using dedicated cpu cores to accelerate i/ovirtualization. In Proceedings of the 23rd InternationalConference on Supercomputing (ICS), pages 225–234,2009.

[31] K. J. Nesbit, J. Laudon, and J. E. Smith. Virtualprivate caches. In Proceedings of the 34th annualinternational symposium on Computer architecture(ISCA), pages 57–68, 2007.

[32] D. Ongaro, A. L. Cox, and S. Rixner. Scheduling i/oin virtual machine monitors. In Proceedings of the 4thACM SIGPLAN/SIGOPS International Conferenceon Virtual Execution Environments (VEE), pages1–10, 2008.

[33] J. K. Ousterhout. Scheduling techniques forconcurrebt systems. In International Conference onDistributed Computing Systems (ICDCS), pages22–30, 1982.

[34] R. Rivas, A. Arefin, and K. Nahrstedt. Janus: across-layer soft real-time architecture forvirtualization. In Proceedings of the 19th ACMInternational Symposium on High PerformanceDistributed Computing (HPDC), pages 676–683, 2010.

[35] J. Shafer, D. Carr, A. Menon, S. Rixner, A. L. Cox,W. Zwaenepoel, and P. Willmann. Concurrent directnetwork access for virtual machine monitors. InInternational Symposium on High PerformanceComputer Architecture (HPCA), pages 306–317, 2007.

[36] R. Singhal. Inside intel next generation nehalemmicroarchitecture. In Intel Developer Forum (IDF)presentation, August 2008.

[37] W. Sobel, S. Subramanyam, A. Sucharitakul,J. Nguyen, H. Wong, A. Klepchukov, S. Patil, O. Fox,and D. Patterson. Cloudstone: Multi-platform,multi-language benchmark and measurement tools forweb 2.0. In The First Workshop Cloud Computing andIts Applications (CCA), 2008.

[38] O. Sukwong and H. S. Kim. Is co-scheduling tooexpensive for smp vms? In Proceedings of the sixthconference on Computer systems (EuroSys), pages257–272, 2011.

[39] R. Uhlig, G. Neiger, D. Rodgers, A. Santoni,F. Martins, A. Anderson, S. Bennett, A. Kagi,F. Leung, and L. Smith. Intel virtualizationtechnology. Computer, 38:48–56, 2005.

[40] V. Uhlig, J. LeVasseur, E. Skoglund, andU. Dannowski. Towards scalable multiprocessorvirtual machines. In Proceedings of the 3rd conferenceon Virtual Machine Research And TechnologySymposium (VM)- Volume 3, pages 4–4, 2004.

[41] C. A. Waldspurger. Memory resource management invmware esx server. In Proceedings of the 5th

symposium on Operating Systems Design andImplementation (OSDI), volume 36, pages 181–194,2002.

[42] G. Wang and T. Ng. The impact of virtualization onnetwork performance of amazon ec2 data center. InThe 29th Annual IEEE International Conference onComputer Communications (INFOCOM), pages 1–9,2010.

[43] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic adaptivescheduling for virtual machines. In Proceedings of the20th international symposium on High PerformanceDistributed Computing (HPDC), pages 239–250, 2011.

[44] P. Willmann, S. Rixner, and A. L. Cox. Protectionstrategies for direct access to virtualized i/o devices.In USENIX 2008 Annual Technical Conference, pages15–28.

[45] S. Xi, J. Wilson, C. Lu, and C. Gill. Rt-xen: towardsreal-time hypervisor scheduling in xen. In Proceedingsof the 9th ACM international conference on Embeddedsoftware, pages 39–48, 2011.

[46] C. Xu, S. Gamage, P. N. Rao, A. Kangarlou,R. Kompella, and D. Xu. vslicer: Latency-awarevirtual machine scheduling via differentiated frequencycpu slicing. In Proceedings of the 21st internationalsymposium on High-Performance Parallel andDistributed Computing (HPDC), pages 3–14, 2012.

vbalance: using interrupt load balance to improve i/o...

Documents