cooperative vm migration for a virtualized hpc cluster ...€¦ · • conclusion and future work 4...

30
Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices IEEE eScience 2012, Oct. 11 2012, Chicago Ryousei Takano , Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices�

IEEE eScience 2012, Oct. 11 2012, Chicago�

Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh

Information Technology Research Institute,

National Institute of Advanced Industrial Science and Technology (AIST), Japan�

Page 2: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Background�•  HPC cloud is a promising e-Science platform.

–  HPC users begin to take an interest in Cloud computing, e.g., Amazon EC2 Cluster Compute Instances.

•  Virtualization is a key technology. –  Pro: It makes migration of computing elements easy.

•  VM migration is useful for achieving fault tolerance, server consolidation, etc.

–  Con: It introduces a large overhead, spoiling I/O performance.

•  VMM-bypass I/O technologies, e.g., PCI passthrough and SR-IOV, can significantly mitigate the overhead.

2

VMM-bypass I/O makes it impossible to migrate a VM.�

Page 3: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Contribution�

•  Goal: –  To realize VM migration and checkpoint/restart on a

virtualized cluster with VMM-bypass I/O devices. •  E.g., VM migration on an Infiniband cluster

•  Contributions: –  We propose cooperative VM migration based on the

Symbiotic Virtualization (SymVirt) mechanism. –  We demonstrate reactive/proactive fault tolerant (FT)

systems. –  We show postcopy migration helps to reduce the

service downtime in the proactive FT system.

3

Page 4: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Agenda�

•  Background and Motivation •  SymVirt: Symbiotic Virtualization Mechanism •  Experiment •  Related Work •  Conclusion and Future Work�

4

Page 5: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Motivating Observation�

•  Performance evaluation of HPC cloud –  (Para-)virtualized I/O incurs a large overhead. –  PCI passthrough significantly mitigate the overhead.

5

0

50

100

150

200

250

300

BT CG EP FT LU

Exec

utio

n tim

e [s

econ

ds]�

BMM (IB) BMM (10GbE) KVM (IB) KVM (virtio)

The overhead of I/O virtualization on the NAS Parallel Benchmarks 3.3.1 class C, 64 processes.�

BMM: Bare Metal Machine�

KVM (virtio)�VM1�

10GbE NIC�

VMM�

Guest driver�

Physical driver�

Guest OS�

KVM (IB) VM1�

IB QDR HCA�

VMM�

Physical driver�

Guest OS�

Page 6: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

6

Para-virtualized device (virtio_net)�

VM1�

NIC�

VMM�

Guest driver�

Physical driver�

Guest OS�VM2�

vSwitch�

…�

PCI passthrough VM1�

NIC�

VMM�

Physical driver�

Guest OS�VM2�

…�

SR-IOV VM1�

NIC�

VMM�

Physical driver�

Guest OS�VM2�

…�

Switch (VEB)�

VMM-Bypass I/O�

Para-virt device�

PCI passthrough�

SR-IOV�

Performance� �� �� ��

Device sharing� �� �� ��

VM migration� �� �� ��

We address this issue!�

Page 7: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Problem�

VMM-bypass I/O technologies make VM migration and checkpoint/restart impossible.

1.  VMM does not know the time when VMM-bypass I/O devices are detached safely. •  To perform such migration without losing in-flight data,

packet transmission to/from the VM should be stopped prior to detaching.

•  With a VMM, it is hard to know the communication status of an application inside the VM, especially if VMM-bypass I/O devices are used.�

2.  VMM cannot migrate the state of VMM-bypass I/O devices from the source to the destination. •  With Infiniband, Local ID, QueuePair Numbers, etc.

7

Page 8: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Infiniband� Infiniband�

Goal�

8

Ethernet�Cluster 1� Cluster 2�

VM1� VM2� VM3� VM4�

We need a mechanism of combining VM migration and PCI device hot-plugging.�

detach� re-attach�

migration�

Page 9: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt: Symbiotic Virtualization�

Page 10: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Cooperative VM Migration�

10

Guest OS�

VMM�

Application�

Migration�

Global coordination�

Device setup�

Existing VM Migration (Black-box approach)

Pro: portability�

Cooperative VM Migration (Gray-box approach)

Pro: performance�

NIC�

Guest OS�

VMM�

Application�

Migration�

Global coordination�

Device setup�

Cooperation�

VMM-bypass I/O�

NIC�

Page 11: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt: Symbiotic Virtualization�

•  We focus on MPI programs. •  We design and implement

a symbiotic virtualization (SymVirt) mechanism. –  It is a cross-layer mechanism

between a VMM and an MPI runtime system.

11

Guest OS�

VMM�

Application�

VMM-bypass I/O�

NIC�

Migration�

Global coordination�

Device setup�

Cooperation�SymVirt�

Page 12: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt: Overview�

12

node #0�

Guest OS�

VMM�

node #n�

Guest OS�

VMM�

Clo

ud s

ched

uler�

MPI lib.�

MPI app.�

SymVirt coordinator�

MPI lib.�

MPI app.�

SymVirt coordinator�

SymVirt coordinator�

MPI lib.�

MPI app.�

SymVirt agent�

2) coordination�

4) do something, e.g., migration/checkpointing�SymVirt

controller�SymVirt

agent�

3) SymVirt wait�5) SymVirt signal�

1) tr

igge

r eve

nts�

...�

Page 13: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt wait and signal calls�•  SymVirt provides a simple Guest OS-to-VMM

communication mechanism. •  SymVirt coordinator issues a SymVirt wait call, the guest

OS is blocked until a SymVirt signal call is issued. •  In the meantime, SymVirt agent controls the VM via a

VMM monitor interface.�

13

confirm�

detach�

Guest OS mode�

VMM mode�migration� re-attach�

confirm�

Application�

SymVirt coordinator�

SymVirt controller/agent�

linkup�

SymVirt wait� SymVirt

signal�

Page 14: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt: Implementation�

•  We implemented SymVirt on top of QEMU/KVM and the Open MPI system.

•  User application and the MPI runtime system can work without any modifications.

•  QEMU/KVM is slightly modified for supporting SymVirt wait and signal calls. –  A SymVirt wait call is implemented by

using a VMCALL Intel VT-x instruction. –  A SymVirt signal call is implemented as

a new QEMU/KVM monitor command.

14

SymVirt agent�

SymVirt coordinator�

SymVirt signal�

SymVirt wait�

Page 15: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt: Implementation (cont’d)�

•  SymVirt coordinator is heavily relied on the Open MPI checkpoint/restart (C/R) framework. –  Global coordination of SymVirt is the same as a

coordination protocol for MPI programs. –  SymVirt executes VM-level migration or C/R instead

of process-level C/R using the BLCR system. –  SymVirt does not need to take care of changing LIDs

and QPNs after a migration, because Open MPI’s BTL modules are re-constructed and connections are re-established at continue or restart phases. �

15 BTL: Point-to-Point Byte Transfer Layer�

Page 16: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymVirt: Implementation (cont’d)�

import symvirt!agent_list = [migrate_from] !ctl = symvirt.Controller(agent_list) !!# device detach !ctl.wait_all() !kwargs = {'tag':'vf0'} !ctl.device_detach(**kwargs) !ctl.signal() !!# vm migration !ctl.wait_all() !kwargs = {'postcopy':True, 'uri':'tcp:%s:%d' ! % (migrate_to[0], migrate_port)} !ctl.migrate(**kwargs) !ctl.remove_agent(migrate_from) !

!# device attach !ctl.append_agent(migrate_to) !ctl.wait_all() !kwargs = {'pci_id':'04:00.0', 'tag':'vf0'} !ctl.device_attach(**kwargs) !ctl.signal() !!ctl.close()��

16

•  SymVirt controller and agent are written in Python.

Page 17: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymPFT: Proactive FT system�

•  A VM-level fault tolerant (FT) system is a use case of SymVirt.

17

Cloud&scheduler�

global&storage&(VM&images)�

allocation�

User requires a virtualized cluster consists of 4 nodes (16 CPUs).�

Page 18: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

SymPFT: Proactive FT system�

•  A VM-level fault tolerant (FT) system is a use case of SymVirt.

•  A VM is migrated from a “unhealthy” node to a “healthy” node before the node crashes.

18

Cloud&scheduler�

global&storage&(VM&images)�

allocation�

Cloud&scheduler�

global&storage&(VM&images)�

re-allocation�

Failure prediction�Failure!!�

VM migration�

Page 19: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Experiment�

Page 20: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Experiment�

•  The overhead of SymPFT –  We used 8 VMs on an Infiniband cluster. –  We migrated a VM once during a benchmark

execution. •  Two benchmark programs written in MPI

–  memtest: a simple memory intensive benchmark –  NAS Parallel Benchmarks (NPB) version 3.3.1

•  Overhead reduction using postcopy migration�

20

Page 21: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Experimental setting�

21

Blade server �Dell PowerEdge M610��

CPU� Intel quad-core Xeon E5540/2.53GHz x2�

Chipset� Intel 5520�

Memory� 48 GB DDR3�

InfiniBand Mellanox ConnectX (MT26428)�

21

Blade switch�

InfiniBand� Mellanox M3601Q (QDR 16 ports)�

We used a 16 node Infiniband cluster, which is a part of the AIST Green Cloud.�

Host machine environment�OS� Debian 7.0�

Linux kernel� 3.2.18�

QEMU/KVM� 1.1-rc3�

MPI� Open MPI 1.6�

OFED� 1.5.4.1�

Compiler� gcc/gfortran 4.4.6�

VM environment�VCPU� 8�Memory� 20 GB�

Only 1 VM runs on 1 host, and an IB HCA is !�assigned to the VM by using PCI passthrough.�

Page 22: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Result: memtest�•  The migration time is dependent on the memory footprint.

–  The migration throughput is less than 3 Gbps.

•  Both hotplug and link-up times are approximately constant. –  The link-up time is not a negligible overhead. c.f., Ethernet

22

This result does not include our proceeding.�

28.5 28.5 28.5 28.6

14.6 13.5 12.5 11.3

35.9 38.7 44.2 53.7

0

20

40

60

80

100

2GB 4GB 8GB 16GB

Exec

utio

n Ti

me

[Sec

onds

]�

migration hotplug linkup

memory footprint�

Page 23: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Result: NAS Parallel Benchmarks�

23

0

200

400

600

800

1000

1200

1400

baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy

BT CG FT LU

linkup postcopy migration hotplug precopy migration application

BT� CG� FT� LU�4417� 3394� 15678� 2348�

Transferred Memory Size during VM Migration [MB]�

Exec

utio

n tim

e [s

econ

ds]�

There is no overhead during normal operations�

The overhead is proportional to the memory footprint.�

+105 s�

+97 s�+299 s�

+103 s�

Page 24: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Integration with postcopy migration�•  In contrast to precopy migration, postcopy migration

transfers memory pages on demand after the execution node is switched to the destination.

•  Postcopy migration can hide the overhead of the hot-add and link-up times by overlapping them and migration.

•  We used our postcopy migration implementation for QEMU/KVM, Yabusame.�

24

a) hot-del� b) migration� c) hot-add� d) link-up�SymPFT (precopy)�

SymPFT (postcopy)�

overhead mitigation�

Page 25: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Result: Effect of postcopy migration�

25

0

200

400

600

800

1000

1200

1400

baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy

BT CG FT LU

linkup postcopy migration hotplug precopy migration application

BT� CG� FT� LU�4417� 3394� 15678� 2348�

Exec

utio

n tim

e [s

econ

ds]�

Postcopy migration can hide the overhead of hotplug and link-up by overlapping them and migration. �

-15 %�

-14 %�

-53 %�

-13 s�

Transferred Memory Size during VM Migration [MB]�

Page 26: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Related Work�•  Some VM-level reactive and proactive FT systems have

been proposed for HPC systems. –  E.g., VNsnap: a distributed snapshots of VMs

•  The coordination is executed by snooping the traffic of a software switch outside the VMs.

–  They do not support VMM-bypass I/O devices.

•  Mercury: a self-virtualization technique –  An OS can turn virtualization on and off on demand. –  It lacks a coordination mechanism among distributed VMM.

•  SymCall: an upcall mechanism from a VMM to the guest OS, using a nested VM Exit call –  SymVirt is a simple hypercall mechanism from a guest OS to the

VMM, assuming it works in cooperation with a cloud scheduler.

26

Page 27: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Conclusion and Future Work�

Page 28: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Conclustion�•  We have proposed a cooperative VM migration

mechanism that enables us to migrate VMs with VMM-bypass I/O devices, using a simple Guest OS-to-VMM communication mechanism, called SymVirt.

•  Using the proposed mechanism, we demonstrated a proactive FT system in a virtualized Infiniband cluster.

•  We also confirmed that postcopy migration helps to reduce the downtime in the proactive FT system.

•  SymVirt can be useful for not only fault tolerant but also load balancing and server consolidation.�

28

Page 29: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Future Work�

•  Interconnect-transparent migration, called “Ninja migration” –  We have submitted another conference paper.

•  Overhead mitigation of SymVirt –  Very long link-up time problem –  Better integration with postcopy migration

•  A generic communication layer supporting cooperative VM migration –  It is independent on an MPI runtime system.�

29

Page 30: Cooperative VM Migration for a virtualized HPC Cluster ...€¦ · • Conclusion and Future Work 4 . Motivating Observation ... • With Infiniband, Local ID, QueuePair Numbers,

Thanks for your attention!�

30

This work was partly supported by JSPS KAKENHI 24700040 and ARGO GRAPHICS, Inc.�