shinjuku: reconciling low tail latency & preemptive scheduling talks/2018...cpu slo: 99% 200us@...

52
Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Kostis Kaffes, Timothy Chong, Jack Humphries, Adam Belay, David Mazieres, Christos Kozyrakis

Upload: others

Post on 26-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Shinjuku: Reconciling Low Tail Latency & Preemptive

SchedulingKostis Kaffes, Timothy Chong, Jack Humphries, Adam Belay, David Mazieres,

Christos Kozyrakis

Page 2: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Problem Overview

The goal is to develop a system that allows the scheduling of mixed latency-critical workloads on a single server while satisfying service level objectives (SLO) and utilizing resources efficiently.

2

Page 3: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Problem Overview

The goal is to develop a system that allows the scheduling of mixed latency-critical workloads on a single server while satisfying service level objectives (SLO) and utilizing resources efficiently.

App 1 (e.g. key-value store)

App 2 (e.g. search engine)CPU

SLO: 99% 200us@ 2000QPS

SLO: 99% 500us@ 500QPS 3

Page 4: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Problem Overview

The goal is to develop a system that allows the scheduling of mixed latency-critical workloads on a single server while satisfying service level objectives (SLO) and utilizing resources efficiently.

App 1a (e.g. KV put/get)

App 1b (e.g. KV range query)

CPU

SLO: 99% 50us10000QPS

SLO: 99% 500us500QPS

4

Page 5: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Why build a new system?

5

Page 6: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Linux Inefficiencies

6

Linux offers many scheduling options: CFS, RR, BVT but with HIGH OVERHEADS

Page 7: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Linux Inefficiencies

7

Linux offers many scheduling options: CFS, RR, BVT but with HIGH OVERHEADS

Bloated Network Stack

Page 8: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Linux Inefficiencies

8

Linux offers many scheduling options: CFS, RR, BVT but with HIGH OVERHEADS

Bloated Network Stack

Solution:Data-plane OS

Page 9: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Data-plane OS, e.g. IX

9

Cores

Packet Queues

RSS

Page 10: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Data-plane OS, e.g. IX

10

Cores

Packet Queues

● Minimize communication● Run-to-completion● Adaptive batching

RSS

Page 11: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

But such systems are not work-conserving

11

Cores

Packet Queues

RSS

● Minimize communication● Run-to-completion● Adaptive batching

Page 12: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Work Stealing Can Alleviate The Problem (ZygOS)

12

Cores

Packet Queues

RSS

Page 13: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Work Stealing Can Alleviate The Problem (ZygOS)

13

Cores

Packet Queues

RSS

Page 14: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Work Stealing Can Alleviate The Problem (ZygOS)

14

Cores

Packet Queues

RSS

Page 15: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Work Stealing Can Alleviate The Problem (ZygOS)

15

Cores

Packet Queues

● High stealing overhead● Sub-optimal decisions● Intra-connection

head-of-line blocking● No preemption

RSS

Page 16: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Shinjuku

16

1. Centralized Queue2. Decoupling of Network Processing and Request Execution3. Scheduling Policies4. Preemption & Context Switch Mechanisms

Page 17: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Shinjuku

17

Page 18: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Scheduling Policies

18

Page 19: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Selecting a scheduling policy

[Wierman]

● The optimal scheduling policy in terms of tail latency depends on the distribution of the service time of the requests of a workload

● Heavy-Tailed vs Light-Tailed

19

Page 20: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Selecting a scheduling policy

[Wierman]

● The optimal scheduling policy in terms of tail latency depends on the distribution of the service time of the requests of a workload

● Heavy-Tailed vs Light-Tailed → Processor Sharing & FCFS

20

Page 21: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Scheduling Approach

● Centralized per request type queues● Goal → Each request stays within a target

latency

Dispatcher

Worker Threads21

Request Queues

Page 22: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Scheduling Approach

1) Which queue to select from?

Dispatcher

Worker Threads22

Request Queues

1

Page 23: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Scheduling Approach

1) Which queue to select from?2) How to set a time slice?

Dispatcher

Worker Threads23

Request Queues

2

1

Page 24: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Scheduling Approach

1) Which queue to select from?2) How to set a time slice?3) Where to enqueue an unfinished request?

Dispatcher

Worker Threads24

Request Queues

2

13

Page 25: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Which queue to select from?

Dispatcher

Worker Threads25

Request Queues

1

Page 26: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Which queue to select from?Idea 1: Queue with the largest normalized backlog

Dispatcher

Worker Threads26

Request Queues

1

Page 27: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Which queue to select from?Idea 1: Queue with the largest normalized backlog

Dispatcher

Worker Threads27

Request Queues

1

(-) Assumes knowledge of the exact execution time of all requests in the system(-) Does not take aging into account

Page 28: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Which queue to select from?Idea 2: Check the first request of each queue and select the one with the highest normalized waiting + remaining execution time

Dispatcher

Worker Threads28

Request Queues

1

Page 29: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Which queue to select from?Idea 2: Check the first request of each queue and select the one with the highest normalized waiting + remaining execution time

Dispatcher

Worker Threads29

Request Queues

1

(-) Assumes knowledge of the exact execution time of all requests in the system

Page 30: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Which queue to select from?

Final Idea: Check only the normalized waiting time of first request of each queue

Dispatcher

Worker Threads30

Request Queues

1

Page 31: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

How to set the time slice?

Dispatcher

Worker Threads31

Request Queues

2

Page 32: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

How to set the time slice?

Low overheads → 5-15us Time Slice

Dispatcher

Worker Threads32

Request Queues

2

Page 33: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Where to enqueue an unfinished request?

Dispatcher

Worker Threads33

Request Queues

3

Page 34: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Where to enqueue an unfinished request?

Idea: Approximate the optimal policy for each request type

Dispatcher

Worker Threads34

Request Queues

3

Page 35: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Where to enqueue an unfinished request?

Idea: Approximate the optimal policy for each request type

Dispatcher

Worker Threads35

Request Queues

● For light-tailed requests enqueue to the front of the queue to approximate FCFS

3

Page 36: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Where to enqueue an unfinished request?

Idea: Approximate the optimal policy for each request type

Dispatcher

Worker Threads36

Request Queues

● For light-tailed requests enqueue to the front of the queue to approximate FCFS

● For heavy-tailed requests enqueue to the back of the queue to approximate PS

3

Page 37: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Preemption & Context Switch Mechanisms

37

Page 38: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Preemption & Context Switch Mechanisms

38

Preemption Mechanism

Sender Overhead (us) Receiver Overhead (us) Total Latency (us)

Linux Signals 0.833 1.009 1.98

Optimized IPI 0.066 0.485 0.797

Page 39: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Preemption & Context Switch Mechanisms

39

Preemption Mechanism

Sender Overhead (us) Receiver Overhead (us) Total Latency (us)

Linux Signals 0.833 1.009 1.98

Optimized IPI 0.066 0.485 0.797

Mechanism Overhead (us)

swapcontext 0.394

Optimized Context Switching

0.044

Page 40: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

EVALUATION

40

Page 41: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Evaluation

41

● Compared Shinjuku to IX and ZygOS● Synthetic Microbenchmarks with different service time distributions● RocksDB - in-memory database● ZygOS, IX: 16 logical worker cores● Shinjuku:

○ 14 logical worker cores○ 1 LC for networking subsystem○ 1 LC for dispatcher

● Optimizing for 99% Latency as function of throughput

Page 42: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Synthetic Workload

42

Page 43: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Synthetic Workload

43

5x

Page 44: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Synthetic Workload

44Bimodal(50 - 1, 50 - 100) Trimodal(33 - 1, 33 - 10, 33 - 100)

Page 45: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Synthetic Workload

45Bimodal(50 - 1, 50 - 100) Trimodal(33 - 1, 33 - 10, 33 - 100)

30x 10x

Page 46: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Importance of Preemption

46Bimodal(50 - 1, 50 - 100)

Page 47: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

RocksDB

47

● Mix of GET and SCAN queries● Need to disable interrupts in “unsafe”

sections, e.g. memory allocation and locks

● Configured to be completely in-memory

Page 48: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

RocksDB

48

● Mix of GET and SCAN queries● Need to disable interrupts in “unsafe”

sections, e.g. memory allocation and locks

● Configured to be completely in-memory

99.5% GET - 0.5% SCAN(1000)

Page 49: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

RocksDB

49

● Mix of GET and SCAN queries● Need to disable interrupts in “unsafe”

sections, e.g. memory allocation and locks

● Configured to be completely in-memory

99.5% GET - 0.5% SCAN(1000)

8x

6.6x

Page 50: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

RocksDB

5050% GET - 50% SCAN(5000)

Page 51: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Future Directions

51

● Move the network subsystem to the NIC● Hardware acceleration of the dispatcher● Automatic inference of workload characteristics

Page 52: Shinjuku: Reconciling Low Tail Latency & Preemptive Scheduling Talks/2018...CPU SLO: 99% 200us@ 2000QPS SLO: 99% 500us@ 500QPS 3 Problem Overview The goal is to develop a system that

Thank you!

Questions?

52