improving virtual machine scheduling in numa multicore systems

55
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng -Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Monday, February 25, 13

Upload: others

Post on 14-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs

Kun Wang, Cheng-Zhong XuWayne State University

http://cs.uccs.edu/~jrao/

Monday, February 25, 13

Page 2: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Multicore SystemsFundamental platform for datacenters and HPC- power-efficiency & parallelism

Performance degradation and unpredictability- contention on shared resources: last-level cache, memory controller ...

NUMA architecture further complicates scheduling- low NUMA factor, remote access is not the only/main concern

Virtualization- app or OS-level optimizations ineffective due to inaccurate virtual topology

Monday, February 25, 13

Page 3: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Multicore SystemsFundamental platform for datacenters and HPC- power-efficiency & parallelism

Performance degradation and unpredictability- contention on shared resources: last-level cache, memory controller ...

NUMA architecture further complicates scheduling- low NUMA factor, remote access is not the only/main concern

Virtualization- app or OS-level optimizations ineffective due to inaccurate virtual topology

Objective:Improve performance and reduce variability

Monday, February 25, 13

Page 4: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Multicore SystemsFundamental platform for datacenters and HPC- power-efficiency & parallelism

Performance degradation and unpredictability- contention on shared resources: last-level cache, memory controller ...

NUMA architecture further complicates scheduling- low NUMA factor, remote access is not the only/main concern

Virtualization- app or OS-level optimizations ineffective due to inaccurate virtual topology

Objective:Improve performance and reduce variability

Approach:Add NUMA and contention awareness

to virtual machine scheduling

Monday, February 25, 13

Page 5: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Motivation

Enumerate different thread-to-core assignments- 4 threads on two-socket Intel Westmere NUMA machine

Calculate the worst to best degradation

020406080

100120

lu sp ft ua bt cg mg ep sphinx3 mcf soplex milc% D

egra

datio

n re

lativ

e to

opt

imal

Parallel workloads Multiprogrammed workloads

Monday, February 25, 13

Page 6: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Motivation

Enumerate different thread-to-core assignments- 4 threads on two-socket Intel Westmere NUMA machine

Calculate the worst to best degradation

020406080

100120

lu sp ft ua bt cg mg ep sphinx3 mcf soplex milc% D

egra

datio

n re

lativ

e to

opt

imal

Parallel workloads Multiprogrammed workloads

scheduling plays an important rolefor NUMA-sensitive workloads

Monday, February 25, 13

Page 7: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Motivation

Enumerate different thread-to-core assignments- 4 threads on two-socket Intel Westmere NUMA machine

Calculate the worst to best degradation

020406080

100120

lu sp ft ua bt cg mg ep sphinx3 mcf soplex milc% D

egra

datio

n re

lativ

e to

opt

imal

Parallel workloads Multiprogrammed workloads

scheduling plays an important rolefor NUMA-sensitive workloads

some workloads are NUMAinsensitive

Monday, February 25, 13

Page 8: Improving Virtual Machine Scheduling in NUMA Multicore Systems

The NUMA Architecture

Two-socket Intel Nehalem NUMA machine

Cross-socket interconnect

Memory node-0 Memory node-1

Processor-0 Processor-1

Monday, February 25, 13

Page 9: Improving Virtual Machine Scheduling in NUMA Multicore Systems

The NUMA Architecture

Two-socket Intel Nehalem NUMA machine

Cross-socket interconnect

Memory node-0 Memory node-1

Processor-0 Processor-1

Shared resource contention

Inter-skt communication overhead

Distributing threads

Remote memory access

Clustering threads

Co-locate data and threads

Monday, February 25, 13

Page 10: Improving Virtual Machine Scheduling in NUMA Multicore Systems

The NUMA Architecture

Two-socket Intel Nehalem NUMA machine

Cross-socket interconnect

Memory node-0 Memory node-1

Processor-0 Processor-1

Shared resource contention

Inter-skt communication overhead

Distributing threads

Remote memory access

Clustering threads

Co-locate data and threads

Performance depends on the complex interplay

Monday, February 25, 13

Page 11: Improving Virtual Machine Scheduling in NUMA Multicore Systems

The NUMA Architecture

Two-socket Intel Nehalem NUMA machine

Cross-socket interconnect

Memory node-0 Memory node-1

Processor-0 Processor-1

Shared resource contention

Inter-skt communication overhead

Distributing threads

Remote memory access

Clustering threads

Co-locate data and threads

Performance depends on the complex interplay

Scheduling is hard due toconflicting policies

Monday, February 25, 13

Page 12: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Micro-benchmark

64B ... ...Private data Private data

Random-linked list

64B

Shared data

Thread-1 Thread-N

...Working set size

Sharing size__sync_add_and_fetch

Monday, February 25, 13

Page 13: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Micro-benchmark

64B ... ...Private data Private data

Random-linked list

64B

Shared data

Thread-1 Thread-N

...Working set size

Sharing size__sync_add_and_fetch

Configurable parameters: working set size

number of threadssharing size

Monday, February 25, 13

Page 14: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Experiment TestbedIntel Xeon E5620

Number of cores 4 cores (2 sockets)

Clock frequency 2.40 GHz

L1 Cache 32KB ICache, 32KB DCache

L2 Cache 256KB unified

L3 Cache 12MB unified, inclusive, shared by 4 cores

IMC 32GB/s bandwidth, 2 memory nodes, each with 8GB

QPI 5.86GT/s, 2 links

Monday, February 25, 13

Page 15: Improving Virtual Machine Scheduling in NUMA Multicore Systems

4 thread, sharing disabled, co-located datawith thread

Single Factor on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Local Remote

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Sharing LLC Separate LLC

00.20.40.60.8

11.21.41.6

1 2 4 6 8 10 12

Within socket Across socket

Working set size (byte) Working set size (byte) Number of threads

(a) Data locality (b) LLC contention (a) Sharing overhead

1 thread, sharing disabled private data disabled, sharing 1 cacheline

Monday, February 25, 13

Page 16: Improving Virtual Machine Scheduling in NUMA Multicore Systems

4 thread, sharing disabled, co-located datawith thread

Single Factor on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Local Remote

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Sharing LLC Separate LLC

00.20.40.60.8

11.21.41.6

1 2 4 6 8 10 12

Within socket Across socket

Working set size (byte) Working set size (byte) Number of threads

(a) Data locality (b) LLC contention (a) Sharing overhead

1. When WSS is smaller than LLC,remote access does not hurt performance

2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact

remote penalty hits performance

1 thread, sharing disabled private data disabled, sharing 1 cacheline

Monday, February 25, 13

Page 17: Improving Virtual Machine Scheduling in NUMA Multicore Systems

4 thread, sharing disabled, co-located datawith thread

Single Factor on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Local Remote

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Sharing LLC Separate LLC

00.20.40.60.8

11.21.41.6

1 2 4 6 8 10 12

Within socket Across socket

Working set size (byte) Working set size (byte) Number of threads

(a) Data locality (b) LLC contention (a) Sharing overhead

1. When WSS is smaller than LLC,remote access does not hurt performance

2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact

remote penalty hits performance

1. Scheduling does not affect performancewhen WSS fits in LLC

2. As WSS increases, LLC contention has diminishing effect on performance

1 thread, sharing disabled private data disabled, sharing 1 cacheline

Monday, February 25, 13

Page 18: Improving Virtual Machine Scheduling in NUMA Multicore Systems

4 thread, sharing disabled, co-located datawith thread

Single Factor on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Local Remote

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Sharing LLC Separate LLC

00.20.40.60.8

11.21.41.6

1 2 4 6 8 10 12

Within socket Across socket

Working set size (byte) Working set size (byte) Number of threads

(a) Data locality (b) LLC contention (a) Sharing overhead

1. When WSS is smaller than LLC,remote access does not hurt performance

2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact

remote penalty hits performance

1. Scheduling does not affect performancewhen WSS fits in LLC

2. As WSS increases, LLC contention has diminishing effect on performance

1. Inter-socket sharing overheadincreases initially but decreases

as more threads are used

1 thread, sharing disabled private data disabled, sharing 1 cacheline

Monday, February 25, 13

Page 19: Improving Virtual Machine Scheduling in NUMA Multicore Systems

4 thread, sharing disabled, co-located datawith thread

Single Factor on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Local Remote

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Sharing LLC Separate LLC

00.20.40.60.8

11.21.41.6

1 2 4 6 8 10 12

Within socket Across socket

Working set size (byte) Working set size (byte) Number of threads

(a) Data locality (b) LLC contention (a) Sharing overhead

1. When WSS is smaller than LLC,remote access does not hurt performance

2. When WSS is beyond LCC capacity,the larger the WSS, the larger impact

remote penalty hits performance

1. Scheduling does not affect performancewhen WSS fits in LLC

2. As WSS increases, LLC contention has diminishing effect on performance

1. Inter-socket sharing overheadincreases initially but decreases

as more threads are used

Memory footprint, thread-level parallelism, and inter-thread

sharing pattern determine how much each factor affects

performance

1 thread, sharing disabled private data disabled, sharing 1 cacheline

Monday, February 25, 13

Page 20: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Multiple Factors on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Intra-S Inter-S

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4K 8K 16K 32K 64K 128K

Intra-S Inter-S

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Intra-S Inter-S

Working set size (byte)

(a) Data locality & LLC contention (b) LLC contention & sharing overhead(c) Locality & LLC contention

& sharing overhead

Working set size (byte)Sharing size (byte)

Intra-S: clustering threadson node-0

Inter-S: distributing threads

Monday, February 25, 13

Page 21: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Multiple Factors on Performance

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Intra-S Inter-S

Nor

mal

ized

runt

ime

00.20.40.60.8

11.21.41.6

4K 8K 16K 32K 64K 128K

Intra-S Inter-S

00.20.40.60.8

11.21.41.6

4M 8M 16M 32M 64M 128M

Intra-S Inter-S

Working set size (byte)

(a) Data locality & LLC contention (b) LLC contention & sharing overhead(c) Locality & LLC contention

& sharing overhead

Working set size (byte)Sharing size (byte)

1. Dominant factor determines performance

2. Dominant factor switches asprogram characteristic changes

Intra-S: clustering threadson node-0

Inter-S: distributing threads

Monday, February 25, 13

Page 22: Improving Virtual Machine Scheduling in NUMA Multicore Systems

VirtualizationApplication and guest OS see virtual topology

Virtual topology physical topology- flat topology

- inaccurate topology

• Static Resource Affinity Table (SRAT)

• inaccurate due to load balancing 00.20.40.60.8

11.21.41.6

Accurate topology Inaccurate topologyN

orm

aliz

ed ru

ntim

e

set_mempolicy in libnuma

4 threads, 32MB WSS, 128KB sharing size

Monday, February 25, 13

Page 23: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Related WorkOptimization via scheduling- Contention management: [TCS’10], [SIGMETRICS’11], [ISCA’11],

[ASPLOS’10]

- Thread clustering: [EuroSys’07]

- NUMA management: [ATC’11], [ASPLOS’13]

Program and system-level optimizations- Program transformation: [PPoPP’10], [CGO’12]

- System support: libnuma, page replication and migration

Monday, February 25, 13

Page 24: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Related WorkOptimization via scheduling- Contention management: [TCS’10], [SIGMETRICS’11], [ISCA’11],

[ASPLOS’10]

- Thread clustering: [EuroSys’07]

- NUMA management: [ATC’11], [ASPLOS’13]

Program and system-level optimizations- Program transformation: [PPoPP’10], [CGO’12]

- System support: libnuma, page replication and migration

Our work:1. requiring no offline profiling, online

2. addressing complex interplays

3. assuming no knowledge on virtual topology

Monday, February 25, 13

Page 25: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Uncore Penalty as a Performance Index

Shared resource contention

Inter-skt communication overheadRemote memory access

Monday, February 25, 13

Page 26: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Uncore Penalty as a Performance Index

Core memory subsystem

Shared resource contention

Inter-skt communication overheadRemote memory access

Monday, February 25, 13

Page 27: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Uncore Penalty as a Performance Index

Core memory subsystem

Uncore memory subsystem

Shared resource contention

Inter-skt communication overheadRemote memory access

Monday, February 25, 13

Page 28: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Uncore Penalty as a Performance Index

Core memory subsystem

Uncore memory subsystem

Shared resource contention

Inter-skt communication overheadRemote memory access Uncore stall cycles

Monday, February 25, 13

Page 29: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Uncore Penalty as a Performance Index

Core memory subsystem

Uncore memory subsystem

Shared resource contention

Inter-skt communication overheadRemote memory access

uncore penalty

Uncore stall cycles

constant

Monday, February 25, 13

Page 30: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Uncore Penalty as a Performance Index

Core memory subsystem

Uncore memory subsystem

Shared resource contention

Inter-skt communication overheadRemote memory access

uncore penalty

Little’s law

uncore latency

outstanding parallel uncore rqstuncore penalty

cycles at least one “critical”uncore rqst

Uncore stall cycles

“critical” uncore rqst = demand data/instruction load + RFO rqst

constant

Monday, February 25, 13

Page 31: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Effectiveness of Uncore Penalty Metric

-1

-0.5

0

0.5

1

4M 8M 16M 32M 64M 128M

Runtime LLC miss rate Uncore penalty

Rel

ativ

e to

Intra

-S

Working set size (byte)

(a) Data locality & LLC contention (b) LLC contention & sharing overhead (c) Locality & LLC contention & sharing overhead

Working set size (byte)Sharing size (byte)

-1

-0.5

0

0.5

1

4M 8M 16M 32M 64M 128M-1

-0.5

0

0.5

1

4K 8K 16K 32K 64K 128K

Relative diff

Monday, February 25, 13

Page 32: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Effectiveness of Uncore Penalty Metric

-1

-0.5

0

0.5

1

4M 8M 16M 32M 64M 128M

Runtime LLC miss rate Uncore penalty

Rel

ativ

e to

Intra

-S

Working set size (byte)

(a) Data locality & LLC contention (b) LLC contention & sharing overhead (c) Locality & LLC contention & sharing overhead

Working set size (byte)Sharing size (byte)

-1

-0.5

0

0.5

1

4M 8M 16M 32M 64M 128M-1

-0.5

0

0.5

1

4K 8K 16K 32K 64K 128K

• LLC miss rate only agrees with runtime in a subset of runs

Relative diff

Monday, February 25, 13

Page 33: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Effectiveness of Uncore Penalty Metric

-1

-0.5

0

0.5

1

4M 8M 16M 32M 64M 128M

Runtime LLC miss rate Uncore penalty

Rel

ativ

e to

Intra

-S

Working set size (byte)

(a) Data locality & LLC contention (b) LLC contention & sharing overhead (c) Locality & LLC contention & sharing overhead

Working set size (byte)Sharing size (byte)

-1

-0.5

0

0.5

1

4M 8M 16M 32M 64M 128M-1

-0.5

0

0.5

1

4K 8K 16K 32K 64K 128K

• LLC miss rate only agrees with runtime in a subset of runs

• Strong linear relationship between uncore penalty and runtime

Uncore penalty LLC miss rate

r = 0.91 r = 0.61

Linear correlation coefficient r (against runtime)Relative diff

Monday, February 25, 13

Page 34: Improving Virtual Machine Scheduling in NUMA Multicore Systems

NUMA-aware Virtual Machine Scheduling

Monitoring uncore penalty- calculate each vCPU’s penalty based on PMU readings

- update penalty when performing periodic scheduler bookkeeping

Identifying NUMA scheduling candidate- rely on application or guest OS to identify NUMA-sensitive vCPUs

vCPU migration- Bias Random Migration (BRM)

Monday, February 25, 13

Page 35: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Bias Random Migration

Memory node-0

Uncore

Core Core

PMUper vCPU uncore penalty

vCPU

Memory node-1

Uncore

Core Core

vCPU

Global penalty

Update(v)

Migrate(v)Procedure BiasRandomPick(v)

rand = random() mod 100if rand < v.prob then

select a core in node v.biasreset v.prob

elseSelect current core

end ifend procedure

Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)

update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1

if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if

end procedure

x

Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if

end procedure

struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}

Monday, February 25, 13

Page 36: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Bias Random Migration

Memory node-0

Uncore

Core Core

PMUper vCPU uncore penalty

vCPU

Memory node-1

Uncore

Core Core

vCPU

Global penalty

Update(v)

Migrate(v)Procedure BiasRandomPick(v)

rand = random() mod 100if rand < v.prob then

select a core in node v.biasreset v.prob

elseSelect current core

end ifend procedure

Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)

update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1

if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if

end procedure

x

Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if

end procedure

struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}

Monday, February 25, 13

Page 37: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Bias Random Migration

Memory node-0

Uncore

Core Core

PMUper vCPU uncore penalty

vCPU

Memory node-1

Uncore

Core Core

vCPU

Global penalty

Update(v)

Migrate(v)Procedure BiasRandomPick(v)

rand = random() mod 100if rand < v.prob then

select a core in node v.biasreset v.prob

elseSelect current core

end ifend procedure

Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)

update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1

if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if

end procedure

x

Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if

end procedure

struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}

Monday, February 25, 13

Page 38: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Bias Random Migration

Memory node-0

Uncore

Core Core

PMUper vCPU uncore penalty

vCPU

Memory node-1

Uncore

Core Core

vCPU

Global penalty

Update(v)

Migrate(v)Procedure BiasRandomPick(v)

rand = random() mod 100if rand < v.prob then

select a core in node v.biasreset v.prob

elseSelect current core

end ifend procedure

Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)

update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1

if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if

end procedure

x

Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if

end procedure

struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}

Monday, February 25, 13

Page 39: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Bias Random Migration

Memory node-0

Uncore

Core Core

PMUper vCPU uncore penalty

vCPU

Memory node-1

Uncore

Core Core

vCPU

Global penalty

Update(v)

Migrate(v)Procedure BiasRandomPick(v)

rand = random() mod 100if rand < v.prob then

select a core in node v.biasreset v.prob

elseSelect current core

end ifend procedure

Procedure Update(v) n = NUMA_CPU_TO_NODE(v.c)

update global penalty and save it in v.unc[n]min = argmin v.unc[x], x= 0,..., NODE-1

if v.unc[n] > v.unc[min] then v.prob ++else v.prob -- v.bias = nend if

end procedure

x

Procedure Migrate(v)Update(v)if v is a migration candidate then new_core = BiasRandomPick(v)else new_core = DefaultPick(v)end if

end procedure

struct vcpu {...int c; //current coreint bias; // migration biasint prob; //migration probabilityint unc[NODE]; //array of global penalties...}

Push migrate to a node with the minimum global penaltyPull migrate (stealing) not touched for load balancing

Monday, February 25, 13

Page 40: Improving Virtual Machine Scheduling in NUMA Multicore Systems

ImplementationXen 4.0.2- patched with Perfctr-Xen to read PMU counters

- update uncore penalty every 10ms when Xen burns credits

- lightweight random number generator using last 2 digits in TSC, in range [0,99]

Guest OS, Linux 2.6.32

- two new hypercalls: tag and clear

- tag a vCPU as candidate if cpus_allowed is a subset of online CPUs

Monday, February 25, 13

Page 41: Improving Virtual Machine Scheduling in NUMA Multicore Systems

WorkloadMicro-benchmark- 4 threads, 128KB sharing size, WSS changes from 4MB to 8MB

Parallel workload- NAS parallel benchmarks except is

- Compiled with OpenMP, busy waiting synchronization

Multiprogrammed workload

- SPEC CPU2006, 4 identical copies of mcf, milc, soplex, sphinx3

- Mixed workload = mcf + milc + soplex + sphinx3

Monday, February 25, 13

Page 42: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Performance

0.4

0.6

0.8

1

1.2

4M 8M 16M 32M 64M 128M

Xen BRM Hand-optimized

Nor

mal

ized

runt

ime

0.4

0.6

0.8

1

1.2

lu sp ft ua bt cg ep mg0.4

0.6

0.8

1

1.2

soplex mcf milc sphinx3 mixed

Micro-benchmark (WSS) Parallel workload Multiprogrammed workload

Hand-optimized: offlinedetermined best policy

Monday, February 25, 13

Page 43: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Performance

0.4

0.6

0.8

1

1.2

4M 8M 16M 32M 64M 128M

Xen BRM Hand-optimized

Nor

mal

ized

runt

ime

0.4

0.6

0.8

1

1.2

lu sp ft ua bt cg ep mg0.4

0.6

0.8

1

1.2

soplex mcf milc sphinx3 mixed

Micro-benchmark (WSS) Parallel workload Multiprogrammed workload

1. BRM outperforms Xen in most experiments and improves performance by up to 31.7%

Hand-optimized: offlinedetermined best policy

Monday, February 25, 13

Page 44: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Performance

0.4

0.6

0.8

1

1.2

4M 8M 16M 32M 64M 128M

Xen BRM Hand-optimized

Nor

mal

ized

runt

ime

0.4

0.6

0.8

1

1.2

lu sp ft ua bt cg ep mg0.4

0.6

0.8

1

1.2

soplex mcf milc sphinx3 mixed

Micro-benchmark (WSS) Parallel workload Multiprogrammed workload

1. BRM outperforms Xen in most experiments and improves performance by up to 31.7%

2. BRM performs closely to Hand-optimized and even outperforms it in some cases

Hand-optimized: offlinedetermined best policy

Monday, February 25, 13

Page 45: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Performance

0.4

0.6

0.8

1

1.2

4M 8M 16M 32M 64M 128M

Xen BRM Hand-optimized

Nor

mal

ized

runt

ime

0.4

0.6

0.8

1

1.2

lu sp ft ua bt cg ep mg0.4

0.6

0.8

1

1.2

soplex mcf milc sphinx3 mixed

Micro-benchmark (WSS) Parallel workload Multiprogrammed workload

1. BRM outperforms Xen in most experiments and improves performance by up to 31.7%

2. BRM performs closely to Hand-optimized and even outperforms it in some cases

3. BRM adds overhead to NUMA insensitive workloads

Hand-optimized: offlinedetermined best policy

Monday, February 25, 13

Page 46: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Reducing Variation

02468

101214

4M 8M 16M 32M 64M 128M

Xen BRM Hand-optimized

Rel

ativ

e st

anda

rd d

evia

tion

(%)

0

4

8

12

16

20

lu sp ft ua bt cg ep mg0

2

4

6

8

10

soplex mcf milc sphinx3 mixed

Micro-benchmark (WSS) Parallel workload Multiprogrammed workload

Monday, February 25, 13

Page 47: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Reducing Variation

02468

101214

4M 8M 16M 32M 64M 128M

Xen BRM Hand-optimized

Rel

ativ

e st

anda

rd d

evia

tion

(%)

0

4

8

12

16

20

lu sp ft ua bt cg ep mg0

2

4

6

8

10

soplex mcf milc sphinx3 mixed

Micro-benchmark (WSS) Parallel workload Multiprogrammed workload

1. BRM reduces runtime variation significantly,with on average no more than 2% variations

Monday, February 25, 13

Page 48: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Overhead

0

5

10

15

20

ep ft mg bt cg ua lu sp

2-thread 4-thread8-thread 16-thread

Run

time

over

head

(%)

migration disabled

Monday, February 25, 13

Page 49: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Overhead

0

5

10

15

20

ep ft mg bt cg ua lu sp

2-thread 4-thread8-thread 16-thread

Run

time

over

head

(%)

1. Less than 2% overhead for 2- and 4-thread workloads2. BRM incurs up to 6.4% overhead for 8 threads, but still useful

3. Running 16-thread workloads is problematic

migration disabled

Monday, February 25, 13

Page 50: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Overhead

0

5

10

15

20

ep ft mg bt cg ua lu sp

2-thread 4-thread8-thread 16-thread

Run

time

over

head

(%)

1. Less than 2% overhead for 2- and 4-thread workloads2. BRM incurs up to 6.4% overhead for 8 threads, but still useful

3. Running 16-thread workloads is problematic

Amazon EC2’s High-CPU Extra Large Instance has 8 vCPUs

migration disabled

Monday, February 25, 13

Page 51: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Conclusion and Future WorkProblem

- sub-optimal scheduling and unpredictable performance

Our approach

- uncore penalty as a performance index

- Bias random migration for online performance optimization

- improves performance and reduces variability

Future work

- inferring NUMA-sensitive vCPU

- improving scalability and considering simultaneous multithreading

Monday, February 25, 13

Page 53: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Backup Slides begin here...

Monday, February 25, 13

Page 54: Improving Virtual Machine Scheduling in NUMA Multicore Systems

Why IPC or CPI Harmful?*

IPC or CPI does not reflect performance,even not for straggler thread

* Wood, MICRO’06

0 0.2 0.4 0.6 0.8

1 1.2

0 10 20 30 40 50 60 70 80 90 100

Cyc

les

per i

nstru

ctio

n

(a) Bt

Running aloneBusy sibling hyperthread

Busy core

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100

Cyc

les

per i

nstru

ctio

n

Time (sec)

(b) Lu

Running aloneBusy sibling hyperthread

Busy core

Monday, February 25, 13

Page 55: Improving Virtual Machine Scheduling in NUMA Multicore Systems

QuestionsWhy not Cycles per Instruction (CPI)?- CPI is not useful for multiprocessor workloads

How to improve scalability?- Li et al., PPoPP’09, relaxing consistency requirement on global update

What workload BRM is not useful for?- short-lived jobs

Does BRM affect fairness and priorities? - NO

Monday, February 25, 13