flexsc: exception-less system calls - presented @ osdi 2010

FlexSCFlexSCFlexible System Call Scheduling with

Exception-Less System Calls

LivioLivio SoaresSoares and Michael Stumm

University of Toronto

2

Motivation

The synchronoussynchronous system call interface is a legacy from the single core era

FlexSC implements efficient and flexibleefficient and flexiblesystem calls for the multicore era

Expensive! Costs are:➔ direct: mode-switch➔ indirect: processor structure pollution

3

FlexSC overview

Two contributions: FlexSC and FlexSC-Threads

Results in:

1) MySQL throughput increase of up to 40%and latency reduction of 30%

2) Apache throughput increase of up to 115%and latency reduction of 50%

4

Performance impact of synchronous syscalls

➔ Xalan from SPEC CPU 2006➔ Virtually no time in the OS

➔ Linux on Intel Core i7 (Nehalem)➔ Injected exceptions with varying frequencies

➔ DirectDirect: emulate null system call➔ IndirectIndirect: emulate “write()” system call

➔ Measured only user-mode time➔ Kernel time ignored

Ideally, user-mode performance is unaltered

5

1K 2K 5K 10K 20K 50K 100K0%

10%

20%

30%

40%

50%

60%

70%Xalan (SPEC CPU 2006)

IndirectDirect

user-mode instructions between exceptions(log scale)

De

gra

da

tio

n (

low

er

is f

ast

er)

Degradation due to sync. syscalls

Apache

MySQL

System calls can halfhalf processor efficiency;indirectindirect cause is major contributor

6

Processor state pollution

➔Key source of performance impact

➔On a Linux write() call:➔ up to 2/3rd of the L1 data cache and data

TLB are evictedevicted

➔Kernel performance equally affected➔ Processor efficiency for OS code is also cut

in halfhalf

7

Synchronous system calls are expensive

User

Kernel

Traditional system calls are synchronousand use exceptions to cross domains

8

Alternative: side-step the boundary

User

Kernel

Exception-less syscallsException-less syscalls remove synchronicityby decoupling invocation from execution

9

Benefits of exception-less system calls

User

Kernel

➔Significantly reduce direct costs➔ Fewer mode switches

➔Allow for batching➔ Reduce indirect costs

➔Allow for dynamic multicore specialization➔ Further reduce direct and indirect costs

10

Exception-less interface: syscall page

write(fd, buf, 4096);

entry = free_syscall_entry();

/* write syscall *//* write syscall */entry->syscall = 1;entry->num_args = 3;entry->args[0] = fd;entry->args[1] = buf;entry->args[2] = 4096;entry->status = SUBMITSUBMIT;

whilewhile (entry->status != DONEDONE)do_something_else();

returnreturn entry->return_code;

11







SUBMITSUBMIT

12







DONEDONE

13

Syscall threads

➔Kernel-only threads➔ Part of application process

➔Execute requests from syscall page➔Schedulable on a per-core basis

14

System call batching

Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls

Avoids direct and indirect costs,even on a single core

15

Dynamic multicore specialization

FlexSC makes specializing cores simple Dynamically adapts to workload needs

16

What programs can benefit from FlexSC?Event-driven servers (e.g., memcached, nginx webserver)

➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly➔ Mix sync and exception-less system calls

Multi-threaded servers: FlexSC-ThreadsFlexSC-Threads➔ Thread library, compatible with Pthreads➔ No changes to app. code or recompilation required➔ Transparently converts legacy syscalls into

exception-less ones

17

FlexSC-Threads library

➔ Hybrid (M-on-N) threading model➔ One kernel visible thread per core➔ Many user threads per kernel-visible thread

➔ Redirects system calls (libc wrappers)➔ Posts exception-less syscall to syscall page➔ Switches to other user-level thread➔ Resumes thread upon syscall completion

Benefits of exception-less syscallswhile maintaining sequential syscall interface

18

FlexSC-Threads in action

User

19


On a syscall:

Post request to system call page Block user-level thread

20


Kernel

On a syscall:

Post request to system call page Block user-level thread Switch to next ready thread

21


User

Kernel

If all user-level threads become blocked:1) enter kernel2) wait for completion of at least 1 syscall

22

Evaluation➔Linux 2.6.33

➔Nehalem (Core i7) server, 2.3GHz➔ 4 cores on a chip

➔Clients connected on 1 Gbps network

➔Workloads➔ Sysbench on MySQL (80% user, 20% kernel)➔ ApacheBench on Apache (50% user, 50% kernel)

➔Default Linux NTPL (“syncsync”) vs.

FlexSC-Threads (“flexscflexsc”)

23

Sysbench: “OLTP” on MySQL (1 core)

0 50 100 150 200 250 3000

100

200

300

400

500

flexscsync

Request Concurrency

Th

rou

gh

pu

t(r

eq

ue

sts/

sec

.)

15% improvement

24

Sysbench: “OLTP” on MySQL (4 cores)

0 50 100 150 200 250 3000

200

400

600

800

1,000

flexscsync

Request Concurrency

Th

rou

gh

pu

t(r

eq

ue

sts/

sec

.)

40% improvement

25

MySQL latency per client request

sync flexsc sync flexsc sync flexsc0

100200300400500600700800900

1,000

256 connections

95th percentileaverage

La

ten

cy

(ms)

1 core 2 cores 4 cores

1900

Up to 30% reduction of averagerequest latencies

26

MySQL processor metrics

IPCL3

L2d-cache

i-cacheTLB

BranchIPC

L3L2

d-cachei-cache

TLBBranch

0

0.2

0.4

0.6

0.8

1

1.2

1.4SysBench (4 cores)

Re

lati

ve P

erf

orm

an

ce

(fle

xsc

/syn

c)

User Kernel

Performance improvements consequence ofmore efficient processor execution

27

ApacheBench throughput (1 core)

0 200 400 600 800 10000

5,00010,00015,00020,00025,00030,00035,00040,00045,000

flexscsync

Request Concurrency

Th

rou

gh

pu

t(r

eq

ue

sts/

sec

.)

80-90% improvement

28

ApacheBench throughput (4 cores)

0 200 400 600 800 10000

5,00010,00015,00020,00025,00030,00035,00040,00045,000

flexscsync

Request Concurrency

Th

rou

gh

pu

t(r

eq

ue

sts/

sec

.)

115% improvement

29

Apache latency per client request

sync flexsc sync flexsc sync flexsc0

5

10

15

20

25

30

256 concurrent requests

99th percentileaverage

La

ten

cy

(ms)

1 core 2 cores 4 cores

238

Up to 50% reduction of averagerequest latencies

30

Apache processor metrics

IPCL3

L2d-cache

i-cacheTLB

BranchIPC

L3L2

d-cachei-cache

TLBBranch

0

0.5

1

1.5

2

Apache (1 core)

Re

lati

ve P

erf

orm

an

ce

(fle

xsc

/syn

c)

User Kernel

Processor efficiency doubles for kerneland user-mode execution

31

Discussion

➔New OS architecture not necessary➔ Exception-less syscalls can coexist with legacy ones

➔Foundation for non-blocking system calls➔ select() / poll() in user-space➔ Interesting case of non-blocking free()

➔Multicore ultra-specialization➔ TCP Servers (Rutgers; Iftode et.al), FS Servers

➔Single-ISA asymmetric cores➔ OS-friendly cores (HP Labs; Mogul et. al)

32

Concluding Remarks➔ System calls degrade server performance

➔ Processor pollution is inherent to synchronous system calls

➔ Exception-less syscallsException-less syscalls➔ Flexible and efficient system call execution

➔ FlexSC-ThreadsFlexSC-Threads ➔ Leverages exception-less syscalls➔ No modifications to multi-threaded applications

➔ Throughput & latency gains➔ 2x throughput improvement for Apache and BIND➔ 1.4x throughput improvement for MySQL

FlexSCFlexSCFlexible System Call Scheduling with

Exception-Less System Calls

LivioLivio SoaresSoares and Michael Stumm

University of Toronto

flexsc: exception-less system calls - presented @ osdi 2010

Documents

syscall threads kernel

buf entryargs2

submitwhile entrystatus

fd entryargs1

elsereturn entryreturn

flexsc flexible system

user threads

synchronous system calls