achieving performance isolation with lightweight co-kernels

Jiannan Ouyang, Brian Kocoloski, John Lange

The Prognostic Lab @ University of Pittsburgh

Kevin Pedretti

Sandia National Laboratories

HPDC 2015

Achieving Performance Isolation with Lightweight Co-Kernels

HPC Architecture

2

�  Move computation to data �  Improved data locality �  Reduced power consumption �  Reduced network traffic

Compute Node

Operating System and Runtimes (OS/R)

Simulation Analytic /

Visualization

Supercomputer

Shared Storage Cluster

Processing Cluster

Problem: massive data movement over interconnects

Traditional In Situ Data Processing

Challenge: Predictable High Performance

3

�  Tightly coupled HPC workloads are sensitive to OS noise and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10]

�  Specialized kernels for predictable performance �  Tailored from Linux: CNL for Cray supercomputers

�  Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten

�  Data processing workloads favor Linux environments

�  Cross workload interference

�  Shared hardware (CPU time, cache, memory bandwidth)

�  Shared system software

How to provide both Linux and specialized kernels on the same node, while ensuring performance isolation??

Approach: Lightweight Co-Kernels

4

�  Hardware resources on one node are dynamically composed into multiple partitions or enclaves

�  Independent software stacks are deployed on each enclave � Optimized for certain applications and hardware

�  Performance isolation at both the software and hardware level

Hardware Linux LWK

Analytic / Visualization

Hardware Linux

Analytic / Visualization Simulation

Simulation

Agenda

5

�  Introduction

�  The Pisces Lightweight Co-Kernel Architecture

�  Implementation

�  Evaluation

�  Related Work

�  Conclusion

Building Blocks: Kitten and Palacios

�  the Kitten Lightweight Kernel (LWK)

�  Goal: provide predictable performance for massively parallel HPC applications

�  Simple resource management policies

�  Limited kernel I/O support + direct user-level network access

�  the Palacios Lightweight Virtual Machine Monitor (VMM)

�  Goal: predictable performance

�  Lightweight resource management policies

�  Established history of providing virtualized environments for HPC [Lange et al. VEE ’11, Kocoloski and Lange ROSS ‘12]

Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/

The Pisces Lightweight Co-Kernel Architecture

7

Linux

Hardware

Isolated Virtual Machine

Applications +

Virtual MachinesPalacios VMM

Kitten Co-kernel (1)

Kitten Co-kernel(2)

Isolated Application

Pisces Pisces

http://www.prognosticlab.org/pisces/

Pisces Design Goals

�  Performance isolation at both software and hardware level

�  Dynamic creation of resizable enclaves

�  Isolated virtual environments

Design Decisions

8

�  Elimination of cross OS dependencies �  Each enclave must implement its own complete set of supported

system calls � No system call forwarding is allowed

�  Internalized management of I/O �  Each enclave must provide its own I/O device drivers and manage

its hardware resources directly

�  Userspace cross enclave communication � Cross enclave communication is not a kernel provided feature �  Explicitly setup cross enclave shared memory at runtime (XEMEM)

�  Using virtualization to provide missing OS features

Cross Kernel Communication

9

Hardware'Par))on' Hardware'Par))on'

User%Context%

Kernel%Context% Linux'

Cross%Kernel*Messages*

Control'Process'

Control'Process'

Shared*Mem**Ctrl*Channel*

Linux'Compa)ble'Workloads'

Isolated'Processes''

+'Virtual'

Machines'

Shared*Mem*Communica6on*Channels*

Ki@en'CoAKernel'

XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems [Kocoloski and Lange, HPDC ‘15]

Agenda

10

�  Introduction



�  Evaluation

�  Related Work

�  Conclusion

Challenges & Approaches

11

�  How to boot a co-kernel? �  Hot-remove resources from Linux, and load co-kernel �  Reuse Linux boot code with modified target kernel address �  Restrict the Kitten co-kernel to access assigned resources only

�  How to share hardware resources among kernels? �  Hot-remove from Linux + direct assignment and adjustment (e.g.

CPU cores, memory blocks, PCI devices) �  Managed by Linux and Pisces (e.g. IOMMU)

�  How to communicate with a co-kernel? �  Kernel level: IPI + shared memory, primarily for Pisces commands �  Application level: XEMEM [Kocoloski HPDC’15]

�  How to route device interrupts?

I/O Interrupt Routing

12

Legacy Device

IO-APIC

Management Kernel Co-Kernel

IRQ Forwarder

IRQ Handler

MSI/MSI-X Device

Management Kernel Co-Kernel

IRQ Forwarder

IRQ Handler

MSI/MSI-X Device

MSI MSI INTx

IPI

Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI)

•  Legacy interrupt vectors are potentially shared among multiple devices •  Pisces provides IRQ forwarding service •  IRQ forwarding is only used during initialization for PCI devices

•  Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X) •  Directly route to the corresponding enclave

Implementation

13

�  Pisces �  Linux kernel module supports unmodified Linux kernels

(2.6.3x – 3.x.y) � Co-kernel initialization and management

�  Kitten (~9000 LOC changes) � Manage assigned hardware resources � Dynamic resource assignment � Kernel level communication channel

�  Palacios (~5000 LOC changes) � Dynamic resource assignment � Command forwarding channel

Pisces: http://www.prognosticlab.org/pisces/ Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/

Agenda

14

�  Introduction



�  Evaluation

�  Related Work

�  Conclusion

Evaluation

15

�  8 node Dell R450 cluster

� Two six-core Intel “Ivy-Bridge” Xeon processors

�  24GB RAM split across two NUMA domains

� QDR Infiniband

� CentOS 7, Linux kernel 3.16

�  For performance isolation experiments, the hardware is

partitioned by NUMA domains.

�  i.e. Linux on one NUMA domain, co-kernel on the other

Fast Pisces Management Operations

16

Operations Latency (ms)

Booting a co-kernel 265.98

Adding a single CPU core 33.74

Adding a 128MB memory block 82.66

Adding an Ethernet NIC 118.98

Eliminating Cross Kernel Dependencies

17

solitary workloads (us) w/ other workloads (us)

Linux 3.05 3.48

co-kernel fwd 6.12 14.00

co-kernel 0.39 0.36

Execution Time of getpid()

�  Co-kernel has the best average performance �  Co-kernel has the most consistent performance �  System call forwarding has longer latency and suffers from

cross stack performance interference

Noise Analysis

18

0

5

10

15

20

0 1 2 3 4 5La

tenc

y (u

s)Time (seconds)

(a) without competing workloads

0 1 2 3 4 5Time (seconds)

(b) with competing workloads

0

5

10

15

20

0 1 2 3 4 5

Late

ncy

(us)

Time (seconds)

(a) without competing workloads

0 1 2 3 4 5Time (seconds)

(b) with competing workloads

Linux

Kitten co-kernel

Co-Kernel: less noise + better isolation * Each point represents the latency of an OS interruption

Single Node Performance

19

0

1

CentOS Kitten/KVM co-Kernel

82

83

84

85

Com

plet

ion

Tim

e (S

econ

ds)

without bgwith bg

0

250

CentOS Kitten/KVM co-Kernel

20250

20500

20750

21000

21250

Thr

ough

put (

GU

PS

)

without bgwith bg

CoMD Performance Stream Performance

Co-Kernel: consist performance + performance isolation

8 Node Performance

20

2

4

6

8

10

12

14

16

18

20

1 2 3 4 5 6 7 8

Throughput (GFLOP/s)

Number of Nodes

co-VMMnative

KVMco-VMM bgnative bg

KVM bg

w/o bg: co-VMM achieves native Linux performance w/ bg: co-VMM outperforms native Linux

Co-VMM for HPC in the Cloud

21

0

20

40

60

80

100

44 45 46 47 48 49 50 51

CD

F (%

)

Runtime (seconds)

Co-VMM Native KVM

CDF of HPCCG Performance (running with Hadoop, 8 nodes)

co-VMM: consistent performance + performance isolation

Related Work

22

� Exascale operating systems and runtimes (OS/Rs) �  Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities)

�  Argo (ANL, LLNL, PNNL, various universities)

�  FusedOS (Intel / IBM)

� mOS (Intel)

� McKernel (RIKEN AICS, University of Tokyo)

Our uniqueness: performance isolation, dynamic resource composition, lightweight virtualization

Conclusion

23

�  Design and implementation of the Pisces co-kernel architecture

�  Pisces framework

� Kitten co-kernel

�  Palacios VMM for Kitten co-kernel

�  Demonstrated that the co-kernel architecture provides

� Optimized execution environments for in situ processing

�  Performance isolation

https://software.sandia.gov/trac/kitten

http://www.prognosticlab.org/pisces/

http://www.prognosticlab.org/palacios

Thank You

Jiannan Ouyang �  Ph.D. Candidate @ University of Pittsburgh �  [email protected] �  http://people.cs.pitt.edu/~ouyang/

�  The Prognostic Lab @ U. Pittsburgh �  http://www.prognosticlab.org

achieving performance isolation with lightweight co-kernels

Software