adaptive virtual machine management in the...

28 International Journal of Systems and Service-Oriented Engineering, 4(2), 28-43, April-June 2014

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

ABSTRACTThe success of cloud computing technologies heavily depends on both the underlying hardware and system software support for virtualization. In this study, we propose to elevate the capability of the hypervisor to monitor and manage co-running virtual machines (VMs) by capturing their dynamic behavior at runtime and adaptively schedule and migrate VMs across cores to minimize contention on system resources hence maximize the system throughput. Implemented at the hypervisor level, our proposed scheme does not require any changes or adjustments to the VMs themselves or the applications running inside them, and minimal changes to the host OS. It also does not require any changes to existing hardware structures. These facts reduce the complexity of our approach and improve portability at the same time. The main intuition behind our approach is that because the host OS schedules entire virtual machines, it loses sight of the processes and threads that are running within the VMs; it only sees the averaged resource demands from the past time slice. In our design, we sought to recreate some of this low level information by using performance counters and simple virtual machine introspection techniques. We implemented an initial prototype on the Kernel Virtual Machine (KVM) and our experimental results show the presented approach is of great potential to improve the overall system throughput in the Cloud environment.

Adaptive Virtual Machine Management in the Cloud:

A Performance-Counter-Driven ApproachGildo Torres, Department of Electrical and Computer Engineering, Clarkson University,

Potsdam, NY, USA

Chen Liu, Department of Electrical and Computer Engineering, Clarkson University, Potsdam, NY, USA

Keywords: Cloud Computing, Hardware Performance Counters, Kernel Virtual Machine (KVM), Virtual Machine Management, Virtualization

DOI: 10.4018/ijssoe.2014040103


International Journal of Systems and Service-Oriented Engineering, 4(2), 28-43, April-June 2014 29

INTRODUCTION

Nowadays cloud computing has become pervasive. Common services provided by the cloud include infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS), among others. Cloud comput-ing has begun to transform the way enterprises deploy and manage their infrastructures. It provides the foundation for a truly agile enter-prise, so that IT can deliver an infrastructure that is flexible, scalable, and most importantly, economical through efficient resource utiliza-tion (Red Hat, 2009).

One of the key enabling technologies for cloud computing is virtualization, which offers users the illusion that their remote machine is running the operating system (OS) of their interest on its own dedicated hardware. But un-derneath is actually a completely different case, where different OS images (Virtual Machines) from different users may be running on the same physical server simultaneously.

The success of cloud computing technolo-gies heavily depends on both the underlying hardware support for virtualization, such as Intel VT and AMD-V, and the increase in the number of cores contained in modern multi-core multi-threading microprocessors (MMMP). Because a single Virtual Machine (VM) does not normally use all the hardware resources available on MMMPs, multiple VMs can run simultaneously on processors such that the resource utilization on the cloud side can be improved. Aggregately, this means increasing the system throughput in terms of the total number of VMs supported by the cloud and even reducing the energy cost of the cloud infrastructure by consolidating the VMs and turning off the resources (processors) not in usage.

Co-running VMs, however, introduces contention on shared resources in MMMP. VMs may compete for the computation resources if they are sharing the same core; or even if they are running on separate cores, they may be competing for the Last Level Cache (LLC) and memory bandwidth if they are sharing the

same die. If not managed carefully, this con-tention might cause a significant performance degradation of the VMs. This is against the original motivation for running them together, and may violate the service level agreement (SLA) between the customer and the cloud service provider.

Traditionally, load balancing of MMMPs have been under the purview of the OS sched-uler. This is still the case in cloud environments that use hosted virtualization such as the Kernel Virtual Machine (KVM). In the case of bare-metal virtualization, the scheduler is imple-mented as part of the Virtual Machine Monitor (VMM, a.k.a. hypervisor). Regardless of where the scheduler resides, the scheduler tries to evenly balance the workload among existing cores. Normally, these workloads are processes and threads, but in a cloud environment they also include entire virtual machines. (Note that this is the host scheduler. Each virtual machine will run its own guest OS, with its own guest scheduler that manages the guest’s processes and threads.) The hypervisor is unaware of potential contention on processor resources among the concurrent VMs it is managing. On top of that, the VMs (and the processes/threads within them) exhibit different behaviors at dif-ferent times during their lifetimes, sometimes being computation-intensive, sometimes be-ing memory-intensive, sometimes being I/O intensive, and other times following a mixed behavior. The fundamental challenge is the semantic gap, i.e. the hypervisor is unaware of when and which guest processes and threads are running. In facing this challenge, we propose to elevate the capability of the hypervisor to monitor and manage the co-running VMs by capturing their dynamic behavior at runtime using hardware performance counters. Once a performance profile and model has been ob-tained and computational phases determined, we then adaptively schedule and migrate VMs across cores according to the predicted phases (as opposed to process and thread boundaries) to minimize the contention on system resources hence maximize the system throughput.



The rest of this paper is organized as follows: a review of related published work is presented in the Related Work section; In the Proposed Scheme section we include a description of the default Virtual Machine Monitor architecture, we introduce the hardware performance counters and some events of inter-est; within this section we also introduce our proposed architecture. Next, the Experiment Setup section introduces the hardware and software setups, benchmarks, and workloads created to test the proposed scheme. The experi-ments conducted and their respective results are presented in the Experimental Results section. Finally, conclusions and future work are drawn in the Conclusion section.

RELATED WORK

In this section we describe some previous work related to resource contention in MMMPs, OS thread scheduling considering resource demands, and some works targeting resource management in cloud environment.

Fedorova (2006) described the design and implementation of three scheduling algorithms for chip multithreaded processors that target contention for the second-level cache. Their experimental results were presented from a simulated dual-core processor based on the UltraSPARC T1 architecture. Their work also studied the effects of L2 cache contention on performance for multi-core processors.

Knauerhase, Brett, Hohlt, Li, and Hahn (2008) showed how runtime observations per-formed by the OS about threads’ behavior can be used to ameliorate performance variability and more effectively exploit multi-core proces-sor resources. This work discusses the idea of distributing benchmarks with a high rate of off-chip requests across different shared caches. The authors proposed to reduce cache interference by spreading the intensive applications apart and co-scheduling them with non-intensive applications. They used cache misses per cycle were as the metric for measuring intensity.

Shelepov et al. (2009) described a sched-uling technique for heterogeneous multi-core systems that does the matching using per-thread architectural signatures. These signatures are compact summaries of threads’ architectural properties collected offline. Their technique is unable to adapt to thread phase changes. It appears to be effective for a reduced set of test cases, however it does not scale well since the complexity of using off-line thread profiling becomes very high as the number of threads increases.

Zhuravlev, Blagodurov, and Fedorova (2010) and Blagodurov, Zhuravlev, and Fe-dorova (2010) conducted a study where they analyzed the factors defining the performance in MMMPs and quantified how much performance degradation can be attributed to contention for each shared resource in multi-core systems. They provided a comprehensive analysis of contention-mitigating techniques that only use scheduling, and developed scheduling algorithms that schedule threads such that the miss rate is evenly distributed among the caches. The authors concluded that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual ap-plications and in optimizing system energy consumption. They found that the LLC miss rates turn out to be a very good heuristic for contention for resources affecting performance in MMMP systems.

Blagodurov and Fedorova (2011) intro-duced the Clavis scheduler as a user-level ap-plication designed to test efficiency of user-level scheduling algorithms for NUMA multi-core systems available in Linux. It monitors work-loads execution through hardware performance counters and gather all the necessary informa-tion for making a scheduling decision, pass it to the scheduling algorithm and enforce the algorithm’s decision.

Lugini, Petrucci, and Mosse (2012) pro-posed PATA (Performance-Asymmetric Thread



Assignment) algorithm, an online thread-to-core assignment policy for heterogeneous multi-core systems. The PATA algorithm works without prior knowledge of the target system and running workloads and makes thread-to-core assignments decisions based on the threads’ IPS (Instructions committed Per Second). This solution is proposed to be implemented on top of existing OS scheduling algorithms. Experi-ment results showed improvements in workload execution over Linux’s scheduler and another online IPS-driven scheduler.

Petrucci, Loques, Mosse, Melhem, Gazala, and Gobriel (2012) proposed an integer linear programming model for heterogeneous multi-core systems using thread phase classification based on combined IPS and LLC misses. Data collected offline was then used to determine the best thread scheduling for power savings while meeting thread performance and memory band-width guarantees with real-time requirements.

Although previous work discussed so far covers some of the fundamental aspects of our work, it worth emphasizing that it entirely targets standard workloads managed by the OS scheduler. None of such studies is oriented towards cloud environment.

Other works have been published in the area of VM resource management. Nathuji, Kansal, and Ghaffarkhah (2010) developed Q-Clouds, a QoS-aware control framework that tunes resource allocations to mitigate performance interference effects. Q-Clouds uses online feedback to build a multi-input multi-output (MIMO) model that captures performance in-terference interactions, and uses it to perform closed loop resource management. Wang, Xu, and Zhao (2012) also proposed to establish a multi-input-multi-output (MIMO) performance model for co-hosted VMs in order to capture their coupling behavior using a fuzzy model. Based on their model, the level of contention on the competing resources is quantified by the model parameters, and later used to assist VM placement and resource allocation deci-sions. Although these works (Nathuji et al.,

2010; Wang et al., 2012) attempted to manage resources in the cloud environment, their effects reside at the system level.

Arteaga, Zhao, Liu, Thanarungroj, and Weng (2010) proposed a cooperative VM scheduling approach that allows software-level VM scheduler and hardware-level thread sched-uler to cooperate and optimize the allocation of MMMP resources to VMs. They present an experiment-based feasibility study which con-firms the effectiveness of processor contention aware VM scheduling.

Weng and Liu (2010) and Weng, Liu, and Gaudiot (2013) introduced a scheme that aimed to support a dynamic, adaptive and scalable operating system scheduling policy for MMMP. They proposed architectural (hardware) strat-egies to construct linear models to capture workload behaviors and then schedule threads according to their resource demands. They employed regression models to ensure that the scheduling policy is capable of responding to the changing behaviors of threads during execution. Compared with the static schedul-ing approach, their phase-triggered scheduling policy achieved up to 29% speedup.

Our work integrates several aspects men-tioned from previous researches and applies them to the cloud environment. We follow the approach to schedule VMs according their different resource demand in order to mitigate the contention on shared processor resources between concurrent VMs (Weng et al., 2013; Arteaga et al., 2010). One significant aspect of our approach is that it acts online at runtime, which does not require a priori information about the VMs. The information pertaining to each VM begins to be collected after it has been launched, and so is its behavioral model. These features greatly improve over previous off-line profiling-based approaches. In addition, our approach is autonomous, which executes transparently to the VMs because it does not require any changes or adjustments made to the VMs themselves or the applications running inside them, nor to the host OS. It also does



not require any changes to existing hardware structures. These facts allow the complexity of applying our scheme to be kept very low, in the meantime improving the portability of our scheme.

PROPOSED SCHEME

Default Virtual Machine Monitor Architecture

As mentioned earlier, in a virtualized envi-ronment, the hypervisor is responsible for creating and running the virtual machines. We implemented our initial prototype on the Kernel Virtual Machine (KVM) hypervisor.

KVM (http://www.linux-kvm.org/) is a full virtualization solution for Linux that can run unmodified guest images. It has been included in the mainline Linux kernel since 2.6.20 and is implemented as a loadable kernel module that converts the Linux kernel into a bare metal hypervisor. KVM relies on hardware (CPUs) containing virtualization extensions like Intel VT-X or AMD-V, leveraging those features to virtualize the CPU.

In the KVM architecture, the VMs are mapped to regular Linux processes (i.e. QEMU

processes) and are scheduled by the standard Linux scheduler. This allows KVM to benefit from all the features of the Linux kernel such as memory management, hardware device drivers, etc. Device emulation is handled by QEMU. It provides emulated BIOS, PCI bus, USB bus and a standard set of devices such as IDE and SCSI disk controllers, network cards, etc. (Red Hat, 2009). Figure 1(a) shows the default system architecture with KVM sitting in between the VMs and the physical hardware, at the host OS level.

Hardware Performance Counters

Hardware Performance Counters (HPC) are special hardware registers available on most modern processors. These registers can be used to count the number of occurrence of certain types of hardware events such as: instructions executed, cache-misses suffered, and branches mispredicted. These events are counted without slowing down the kernel or applications because they use dedicated hardware that does not incur any additional overhead. Although originally implemented for debugging hardware designs during development, performance tuning pur-poses, or identifying bottlenecks in program

Figure 1. Overall system architecture



execution, nowadays they are widely used for gathering runtime information of programs and performance analysis (Weaver & McKee, 2008; Bandyopadhyay, 2010).

The types and number of available events to track and the methodologies for using these performance counters vary widely, not only across architectures, but also across systems sharing an ISA. Depending on the model of the microprocessor, they may include 2, 4, 6, 8 or even more of these counters. Intel’s most modern processors offer more than a hundred different events that can be monitored (Intel, 2013). Therefore, it is up to the programmer to select which events to monitor and set the configuration registers appropriately.

Several profiling tools are available to ac-cess HPCs on different microprocessor families. They are relatively simple to use and usually allow monitoring a larger number of unique events than the number of physical counters available in the processor. This is possible us-ing round-robin scheduling of monitored events through multiplexing and estimation. However, the results are presented after completing the execution of the target application.

Since online profiling is required by our scheme, we did not use any of the available profiling tools. We directly configure and access the HPCs via simple RDMSR/WRMSR opera-tions (Intel, 2013). This fact also contributes to the autonomy of the proposed scheme.

Proposed Architecture

As shown in Figure 1, there is no difference between a regular process and a virtual machine. We propose to assist the hypervisor to adap-tively monitor and model each VM’s behavior separate from that of processes. Then, given the models, we assist the Host OS in managing the co-running VMs by providing hints to the scheduler as to which VM should be scheduled on which core to minimize resource contention.

The proposed architecture is based on the following key aspects:

• Utilize Hardware Performance Counters (HPCs) that exist in most modern micro-processors to collect runtime information on each VM.

• The information collected from the HPCs is used to derive the resource demand of each VM and construct a dynamic model to capture their runtime behavior.

• With the VM’s behavioral information obtained from above, and after a com-prehensive evaluation of their resource demands, our scheme (modified hypervi-sor) will generate a VM-to-core mapping that minimizes the contention for resources among the VMs.

• This mapping is then used by the Host OS scheduler to commit the mappings.

Figure 1(b) illustrates the overall layout of the proposed scheme. It shows the runtime monitoring module added to KVM getting information from the HPCs. Figure 2 shows a block diagram containing the basic steps per-formed by our runtime and monitoring module added to KVM.

Hardware Events and Runtime Model

Extensive work has been done in identifying which shared resources affect performance the most in MMMPs (Weng & Liu, 2010; Weng et al., 2013; Zhuravlev et al., 2010; Cazorla et al., 2006; Cheng, Lin, Li, & Yang, 2010). At this moment, completing an extensive study in this area is not part of the objectives of this work. This paper follows the approach presented by Weng and Liu (2010) and Weng et al. (2013) and adopts Last-Level-Cache (LLC) misses as an essential indicator to the demand on shared resources in the MMMPs. The work that we present here is mainly performed at the hypervisor level, which is at a different level from Weng and Liu (2010) and Weng et al. (2013). Since the previous work was able to model the resource demands of threads into



Figure 2. General scheme



phases, our current work simply treats entire VMs as “threads”.

The hardware events we monitor are: LLC misses, L1 misses, and committed instructions. We create a simple linear model for evaluating the feasibility of the proposed approach. The model is based on the metrics known as Misses Per Kilo Instructions (MPKI) and Misses Per Million Instructions (MPMI) for both LLC and L1 cache (Cheng et al., 2010). Depending on the MPKI and MPMI values, we classify the VMs into two categories: computation-intensive (for lower values) and memory-intensive (for higher values).

The proposed scheme is continuously monitoring and profiling the running VMs. It manipulates the HPCs to start counting right before giving control to a VM (VM-entry) and records the counters’ values upon getting control back from the running VM (VM-exit). The values collected from the hardware counters are recorded as samples, which are later used to calculate the target metric (MPKI or MPMI). Such implementation allows our scheme to col-lect detailed information from each VM while still incurring in very low overhead.

Every certain number of samples (this value is a parameter of our scheme), the MPKI/MPMI value for each VM is calculated and the VMs are classified as computation or memory-intensive. Next, our scheme determines the VM-to-core mapping following the Mix-Scheduling criteria presented by Weng et al. (2013). Such a mapping pattern pairs the threads in such a way that the difference of LLC misses is maximized among VMs mapped to the same core. Our method groups together VMs with different processor usage behaviors (categories) in order to reduce

the contention on shared processor resources among concurrent VMs, separating VMs of similar LLC miss level.

EXPERIMENT SETUP

This section introduces the hardware and soft-ware setups, the benchmarks used in this work, and the distinct workloads created to test the proposed scheme.

Hardware Environment

The hardware platform we used for this work is a physical machine (laptop) with an Intel Core 2 Duo CPU, Model T9400 running at 2.53GHz. This processor has two cores with private 64KB L1 (32KB Data + 32KB Instruction) caches each and 6MB L2 shared cache (LLC). It does not support hyper-threading (HT) technology. Therefore, only two threads can be executed simultaneously.

Even though this is not the ideal hardware platform, it is what was available to us at the time this research was started and the experiments here presented were conducted. An immediate extension of this research is to test our scheme using more advanced MMMP processors hous-ing a larger number of cores as well as equipped with multi-threading technology.

Software Environment

Our software setup consists of the following:

• Host OS: Ubuntu 13.04 (64-bit) (3.8.0-22-generic)

Table 1. Virtual machines workloads



• Hypervisor: Kernel-based Virtual Machine (KVM) version 3.8.0-22-generic (same as host)

• Guest OS: Xubuntu 12.04 (32-bit) (3.2.0-37-generic)

Benchmarks

In our study we used the SPEC CPU2006 benchmark suite (Henning, 2006) to construct different types of workloads for the VMs. We performed offline profiling of different bench-marks in order to get their LLC behavior for generating balanced and unbalanced workloads. Based on the profiling information we obtained, we selected benchmarks milc and sphinx3. Both of them are floating-point benchmarks repre-senting high and low LLC MPKI respectively.

For our experiments, we created four differ-ent workloads that represent different combina-

tions of high-low LLC MPKI. Each workload consists of 4 VMs, with each VM running one benchmark inside. Table 1 lists the construction of the workloads and what they represent in terms of LLC misses; “high” means the level of LLC misses is high (memory-intensive) and “low” means the level of LLC misses is low (computation-intensive).

EXPERIMENTAL RESULTS

In this section, we describe the experiments conducted and present the results obtained. These experiments aim to provide an initial assessment of the feasibility of the proposed scheme for adaptive VM management in the cloud.

A key objective for this work is to test and validate the fact that neither the OS scheduler nor

Figure 3. System throughput (VMs)



the hypervisor is aware of the potential conten-tion on processor resources among concurrent VMs. Therefore, we tested all four workloads with three different schemes and measure their performance. Firstly we ran the experiment with the default OS scheduler. Secondly, we tested our proposed approach (VM-mix-scheduling) following the scheduling criteria explained previously, which is to schedule the VMs of different resource demand onto the same core. Thirdly, we tested the opposite of the proposed scheduling criteria (VM-mono-scheduling), which is to allocate VMs of the same LLC miss category onto the same core.

Since each benchmark has different running times, in order to mitigate this disparity when creating the workloads and based on empirical trials, we ran each benchmark several times (a loop that is different for each benchmark) in an

effort to keep the total running time the same across different VMs.

In our experiments, we identified each VM as memory-intensive or computation-intensive based on the LLC misses. Then, after the hypervisor (KVM) generates the VM-to-core mapping, it only manipulates the affinity of each VM in the host machine, which lets the OS scheduler to perform the migration for each VM that requires it. It is worth mentioning that in our approach we “prioritize” the VMs over the rest of the host processes because we assume that in a server hosting multiple VMs (as well as in the cloud), such VMs might have the highest priority in terms of throughput. Basically the affinity decisions made by the proposed scheme are based solely on the VMs, while the host OS scheduler’s decisions are based on the VMs and all other processes running on the host.

Figure 4. Throughput improvement of VM-mix-scheduling normalized to OS default



For this set of experiments the hypervisor makes VM-to-core mapping decisions approxi-mately every 30 seconds for simplicity sake (this time directly depends on the number of samples collected between updates to each VM’s MPKI/MPMI value). This is an important factor to keep in mind because it may affect the total overhead incurred by the scheme. If the overhead is too big, it may have a counterproductive effect on the system throughput. Furthermore, time intervals between mapping decisions should be dynamic based on the resource demand models in practice, although the overhead of updating the models does provide a minimum time. Finding the impact of different time internals on the effectiveness of our proposed approach is the subject of our next research step.

Figure 3 shows the throughput results of the three evaluated schemes for all four workloads. It can be noticed that the VM-mix-scheduling

scheme performs slightly better than the OS default for the first three workloads, and better that VM-mono-scheduling for all cases. For the fourth workload, the OS default scheduler outperforms the other two schemes. Upon fur-ther analysis, we believe that this is expected because with the proposed scheme the low LLC miss VM will be swapped with a high LLC miss VM at every single decision point. For instance, assume that the current mapping of VMs to cores is {L1,H2} and {H3,H4} where the subscripts are used to denote the VM number. At the next decision point VM 2 will have lower MPKIs than VMs 3 and 4 since it was paired with VM 1. According to the model, VM 2 will be swapped with either VM 3 or VM 4 in order to ensure that the average MPKI is minimized. This in turn forces a complete cache flush of both cores, which will increase the MPKIs for

Figure 5. Average VM speedup of VM-mix-scheduling over OS default



all VMs. Once again, better models and adap-tive decision intervals are left as future work.

Figure 4 shows the throughput improve-ment obtained from the VM-mix-scheduling scheme normalized to that of the OS default. Figure 5 shows the average speedup per VM of VM-mix-scheduling scheme over the OS default. The later metric is calculated by averag-ing the performance benefits obtained on each VM for each individual workload.

The throughput improvement achieved by the VM-mix-scheduling scheme compared to the OS Default is modest for all cases. As we can see, our scheme works best under all memory-intensive VMs, with a throughput improvement of 3.6% and an average VM speedup of 4.6%, over OS default scheme. The performance data from Workload 4 is somewhat discouraging. However, making a model to be aware of this type of scenarios may avoid the behavior described above. Overall, we believe that these results serve as an indicator that in-deed, a hypervisor that is aware of the resource contention existing among co-running VMs can improve the overall system throughput in the cloud environment.

In the hardware platform used for this work, the VMs running on separate cores do not content for the computational resources, since they have a complete physical core for themselves. They mainly compete for off-core resources like LLC, memory controller, etc. We think that this fact partially limits the real benefits that could be obtained from applying our proposed scheme. As mentioned earlier, an immediate extension of this research is to ex-pand our experiment to more advanced MMMP processors containing a larger number of cores that share both on-chip and off-chip resources (like hyper-threaded processors).

CONCLUSION

Over the last few years there has been an expo-nential growth in the cloud computing market. The success of cloud computing technologies heavily depends on both the underlying hard-ware and system software support for virtualiza-tion. For maintaining such rates of growth, it is vital to efficiently utilize the available hardware resources by understanding the contention on shared processor resources among co-running VMs to mitigate performance degradation and system throughput penalties.

In this article, we presented an adaptive scheduling scheme for virtual machine man-agement in the cloud following a performance-counter-driven approach. Our proposed ap-proach acts online at runtime, which does not require a priori information about the VMs, improving over offline profiling-based ap-proaches. The proposed scheme is also trans-parent, which executes behind the scenes and does not require the intervention from the VMs, the users of VMs or the core Operating System kernel. It creates and maintains a dynamic model to capture a VM’s runtime behavior and manage VM-to-core mappings to minimize contention. Our experimental results showed that the pro-posed scheme achieved modest improvement in terms of the overall system (VMs) throughput as a result of mitigating resource contention among VMs.

Future stages of this work include improv-ing the hardware platform to be used with multi-threading technology, as well as conducting a more comprehensive hardware event monitor-ing. At this point, our current implementation is at a coarse-grain level in the way we categorize each VM. We plan to direct our efforts towards enhancing the current design with dynamic phase detection and prediction capability. At a more advanced stage, we would like to extend the implementation of our scheme to include non-uniform memory access (NUMA) and heterogeneous architectures as well.



ACKNOWLEDGMENT

This work is supported under 2013 Visiting Fac-ulty Research Program by Air Force Research Laboratory. We would like to thank our mentor Steve for his guidance and our team members Lok, Sergay and Matt for their constructive

discussions that made this work possible. We also want to thank the reviewers for their com-ments and feedback. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Air Force Research Laboratory.



REFERENCES

Arteaga, D., Zhao, M., Liu, C., Thanarungroj, P., & Weng, L. (2010). Cooperative virtual machine scheduling on multi-core multi-threading systems - A feasibility study. In Proceedings of the Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Cloud.

Bandyopadhyay, S. (2010). A study on performance monitoring counters in x86-architecture. Indian Statistical Institute. Retrieved December 5, 2013, from http://www.cise.ufl.edu/~sb3/files/pmc.pdf

Blagodurov, S., & Fedorova, A. (2011). User-level scheduling on NUMA multicore systems under Linux. In Proc. of Linux Symposium.

Blagodurov, S., Zhuravlev, S., & Fedorova, A. (2010). Contention-aware scheduling on multicore systems. ACM Trans. Comput. Syst., 28(4), 8:1-8:45. Doi: 10.1145/1880018.1880019.

Cazorla, F., Knijnenburg, P. M. W., Sakellariou, R., Fernandez, E., Ramirez, A., & Valero, M. (2006). Predictable performance in SMT processors: Syn-ergy between the OS and SMTs. IEEE Transac-tions on Computers, 55(7), 785–799. doi:10.1109/TC.2006.108

Cheng, H.-Y., Lin, C.-H., Li, J., & Yang, C.-L. (2010). Memory latency reduction via thread throttling. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 53-64). doi:10.1109/MICRO.2010.39

Fedorova, A. (2006). Operating system scheduling for chip multithreaded processors. Unpublished doctoral dissertation, Harvard University. Retrieved Decem-ber 5, 2013, from http://www.cs.sfu.ca/~fedorova/thesis.pdf

Henning, J. L. (2006). SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4), 1–17. doi:10.1145/1186736.1186737

Intel, I. (2013). Intel 64 and IA-32 architectures software developer’s manual. Retrieved December 5, 2013, from http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-ar-chitectures-software-developer-manual-325462.pdf

Knauerhase, R., Brett, P., Hohlt, B., Li, T., & Hahn, S. (2008). Using OS observations to improve per-formance in multicore systems. Micro, IEEE, 28(3), 54–66. doi:10.1109/MM.2008.48

Lugini, L., Petrucci, V., & Mosse, D. (2012). Online thread assignment for heterogeneous multicore sys-tems. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops 0, 538-544. http://doi.ieeecomputersociety.org/10.1109/ICPPW.2012.73

Nathuji, R., Kansal, A., & Ghaffarkhah, A. (2010). Q-clouds: Managing performance interference ef-fects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems (pp. 237-250). ACM. ISBN: 978-1-60558-577-2. doi:10.1145/1755913.1755938

Petrucci, V., Loques, O., Mosse, D., Melhem, R., Gazala, N., & Gobriel, S. (2012). Thread assign-ment optimization with real-time performance and memory bandwidth guarantees for energy-efficient heterogeneous multi-core systems. In Proceedings of the 2012 IEEE 18th on Real-Time and Embedded Technology and Applications Symposium (RTAS) (pp. 263-272). doi:10.1109/RTAS.2012.13

Red Hat, I. (2009). KVM – kernel based virtual machine. Red Hat, Inc. Retrieved December 5, 2013, from http://www.redhat.com/rhecm/rest-rhecm/jcr/repository/collaboration/jcr:system/jcr:versionStorage/5e7884ed7f00000102c317385572f1b1/1/jcr:frozenNode/rh:pdfFile.pdf

Red Hat, I. (2013). Kernel based virtual machine (KVM). Retrieved from http://www.linux-kvm.org/

Shelepov, D., Saez Alcaide, J. C., Jeffery, S., Fe-dorova, A., Perez, N., & Huang, Z. F. et al. (2009). HASS: A Scheduler for Heterogeneous Multicore Systems. SIGOPS Oper. Syst. Rev., 43(2), 66–75. doi:10.1145/1531793.1531804

Wang, L., Xu, J., & Zhao, M. (2012). Modeling VM performance interference with fuzzy MIMO model. In Proceedings of the 7th International Workshop on Feedback Computing (FeedbackComputing, co-held with ICAC2012).

Weaver, V., & McKee, S. (2008). Can hardware performance counters be trusted? In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC 2008) (pp. 141-150). doi:10.1109/IISWC.2008.4636099

http://www.cise.ufl.edu/~sb3/files/pmc.pdf

http://dx.doi.org/10.1109/TC.2006.108

http://dx.doi.org/10.1109/TC.2006.108

http://dx.doi.org/10.1109/MICRO.2010.39

http://www.cs.sfu.ca/~fedorova/thesis.pdf

http://www.cs.sfu.ca/~fedorova/thesis.pdf

http://dx.doi.org/10.1145/1186736.1186737

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf



http://dx.doi.org/10.1109/MM.2008.48

http://doi.ieeecomputersociety.org/10.1109/ICPPW.2012.73

http://doi.ieeecomputersociety.org/10.1109/ICPPW.2012.73

http://dx.doi.org/10.1145/1755913.1755938

http://dx.doi.org/10.1109/RTAS.2012.13

http://www.redhat.com/rhecm/rest-rhecm/jcr/repository/collaboration/jcr:system/jcr:versionStorage/5e7884ed7f00000102c317385572f1b1/1/jcr:frozenNode/rh:pdfFile.pdf




http://www.linux-kvm.org/

http://dx.doi.org/10.1145/1531793.1531804

http://dx.doi.org/10.1109/IISWC.2008.4636099



Weng, L., & Liu, C. (2010). On better performance from scheduling threads according to resource de-mands in MMMP. In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops (ICPPW) (pp. 339-345). doi:10.1109/ICPPW.2010.53

Weng, L., Liu, C., & Gaudiot, J.-L. (2013). Scheduling optimization in multicore multithreaded micropro-cessors through dynamic modeling. In Proceedings of the ACM International Conference on Computing Frontiers (pp. 5:1--5:10). ACM. ISBN: 978-1-4503-2053-5. doi:10.1145/2482767.2482774

Zhuravlev, S., Blagodurov, S., & Fedorova, A. (2010). Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the Fif-teenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (pp. 129--142). ACM. ISBN: 978-1-60558-839-1. doi:10.1145/1736020.1736036

http://dx.doi.org/10.1109/ICPPW.2010.53

http://dx.doi.org/10.1109/ICPPW.2010.53

http://dx.doi.org/10.1145/2482767.2482774

http://dx.doi.org/10.1145/1736020.1736036



Gildo Torres received his B.S. degree in Electrical Engineering in 2006 from the Higher Polytechnic Insti-tute of Havana, Cuba. He received his M.S. degree in Computer Engineering in May 2012, from Florida International University. Currently, Gildo is a second year Ph.D. student in the Department of Electrical and Computer Engineering at Clarkson University. He conducts research in the area of computer architec-ture as member of the Computer Architecture and Microprocessor Engineering Lab (CAMEL) at Clarkson University. He is a student member of IEEE and ACM.

Chen Liu received the B.E. degree in Electronics and Information Engineering from University of Science and Technology of China in 2000, the M.S. degree in Electrical Engineering from the University of Califor-nia, Riverside in 2002, and the Ph.D. degree in Electrical and Computer Engineering from the University of California, Irvine in 2008, respectively. Currently he is an assistant professor in the Department of Electrical and Computer Engineering at Clarkson University, New York, USA. His research interests are in the area of multi-core multi-threading architecture, hardware acceleration for scientific computing and the interaction between system software and micro-architecture. He is a member of IEEE, ACM, and ASEE.

adaptive virtual machine management in the...

Documents