flexsc: exception-less system calls - presented @ osdi 2010
TRANSCRIPT
FlexSCFlexSCFlexible System Call Scheduling with
Exception-Less System Calls
LivioLivio SoaresSoares and Michael Stumm
University of Toronto
2
Motivation
The synchronoussynchronous system call interface is a legacy from the single core era
FlexSC implements efficient and flexibleefficient and flexiblesystem calls for the multicore era
Expensive! Costs are:➔ direct: mode-switch➔ indirect: processor structure pollution
3
FlexSC overview
Two contributions: FlexSC and FlexSC-Threads
Results in:
1) MySQL throughput increase of up to 40%and latency reduction of 30%
2) Apache throughput increase of up to 115%and latency reduction of 50%
4
Performance impact of synchronous syscalls
➔ Xalan from SPEC CPU 2006➔ Virtually no time in the OS
➔ Linux on Intel Core i7 (Nehalem)➔ Injected exceptions with varying frequencies
➔ DirectDirect: emulate null system call➔ IndirectIndirect: emulate “write()” system call
➔ Measured only user-mode time➔ Kernel time ignored
Ideally, user-mode performance is unaltered
5
1K 2K 5K 10K 20K 50K 100K0%
10%
20%
30%
40%
50%
60%
70%Xalan (SPEC CPU 2006)
IndirectDirect
user-mode instructions between exceptions(log scale)
De
gra
da
tio
n (
low
er
is f
ast
er)
Degradation due to sync. syscalls
Apache
MySQL
System calls can halfhalf processor efficiency;indirectindirect cause is major contributor
6
Processor state pollution
➔Key source of performance impact
➔On a Linux write() call:➔ up to 2/3rd of the L1 data cache and data
TLB are evictedevicted
➔Kernel performance equally affected➔ Processor efficiency for OS code is also cut
in halfhalf
7
Synchronous system calls are expensive
User
Kernel
Traditional system calls are synchronousand use exceptions to cross domains
8
Alternative: side-step the boundary
User
Kernel
Exception-less syscallsException-less syscalls remove synchronicityby decoupling invocation from execution
9
Benefits of exception-less system calls
User
Kernel
➔Significantly reduce direct costs➔ Fewer mode switches
➔Allow for batching➔ Reduce indirect costs
➔Allow for dynamic multicore specialization➔ Further reduce direct and indirect costs
10
Exception-less interface: syscall page
write(fd, buf, 4096);
entry = free_syscall_entry();
/* write syscall *//* write syscall */entry->syscall = 1;entry->num_args = 3;entry->args[0] = fd;entry->args[1] = buf;entry->args[2] = 4096;entry->status = SUBMITSUBMIT;
whilewhile (entry->status != DONEDONE)do_something_else();
returnreturn entry->return_code;
11
Exception-less interface: syscall page
write(fd, buf, 4096);
entry = free_syscall_entry();
/* write syscall *//* write syscall */entry->syscall = 1;entry->num_args = 3;entry->args[0] = fd;entry->args[1] = buf;entry->args[2] = 4096;entry->status = SUBMITSUBMIT;
whilewhile (entry->status != DONEDONE)do_something_else();
returnreturn entry->return_code;
SUBMITSUBMIT
12
Exception-less interface: syscall page
write(fd, buf, 4096);
entry = free_syscall_entry();
/* write syscall *//* write syscall */entry->syscall = 1;entry->num_args = 3;entry->args[0] = fd;entry->args[1] = buf;entry->args[2] = 4096;entry->status = SUBMITSUBMIT;
whilewhile (entry->status != DONEDONE)do_something_else();
returnreturn entry->return_code;
DONEDONE
13
Syscall threads
➔Kernel-only threads➔ Part of application process
➔Execute requests from syscall page➔Schedulable on a per-core basis
14
System call batching
Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls
Avoids direct and indirect costs,even on a single core
15
Dynamic multicore specialization
FlexSC makes specializing cores simple Dynamically adapts to workload needs
16
What programs can benefit from FlexSC?Event-driven servers (e.g., memcached, nginx webserver)
➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly➔ Mix sync and exception-less system calls
Multi-threaded servers: FlexSC-ThreadsFlexSC-Threads➔ Thread library, compatible with Pthreads➔ No changes to app. code or recompilation required➔ Transparently converts legacy syscalls into
exception-less ones
17
FlexSC-Threads library
➔ Hybrid (M-on-N) threading model➔ One kernel visible thread per core➔ Many user threads per kernel-visible thread
➔ Redirects system calls (libc wrappers)➔ Posts exception-less syscall to syscall page➔ Switches to other user-level thread➔ Resumes thread upon syscall completion
Benefits of exception-less syscallswhile maintaining sequential syscall interface
18
FlexSC-Threads in action
User
19
FlexSC-Threads in action
On a syscall:
Post request to system call page Block user-level thread
20
FlexSC-Threads in action
Kernel
On a syscall:
Post request to system call page Block user-level thread Switch to next ready thread
21
FlexSC-Threads in action
User
Kernel
If all user-level threads become blocked:1) enter kernel2) wait for completion of at least 1 syscall
22
Evaluation➔Linux 2.6.33
➔Nehalem (Core i7) server, 2.3GHz➔ 4 cores on a chip
➔Clients connected on 1 Gbps network
➔Workloads➔ Sysbench on MySQL (80% user, 20% kernel)➔ ApacheBench on Apache (50% user, 50% kernel)
➔Default Linux NTPL (“syncsync”) vs.
FlexSC-Threads (“flexscflexsc”)
23
Sysbench: “OLTP” on MySQL (1 core)
0 50 100 150 200 250 3000
100
200
300
400
500
flexscsync
Request Concurrency
Th
rou
gh
pu
t(r
eq
ue
sts/
sec
.)
15% improvement
24
Sysbench: “OLTP” on MySQL (4 cores)
0 50 100 150 200 250 3000
200
400
600
800
1,000
flexscsync
Request Concurrency
Th
rou
gh
pu
t(r
eq
ue
sts/
sec
.)
40% improvement
25
MySQL latency per client request
sync flexsc sync flexsc sync flexsc0
100200300400500600700800900
1,000
256 connections
95th percentileaverage
La
ten
cy
(ms)
1 core 2 cores 4 cores
1900
Up to 30% reduction of averagerequest latencies
26
MySQL processor metrics
IPCL3
L2d-cache
i-cacheTLB
BranchIPC
L3L2
d-cachei-cache
TLBBranch
0
0.2
0.4
0.6
0.8
1
1.2
1.4SysBench (4 cores)
Re
lati
ve P
erf
orm
an
ce
(fle
xsc
/syn
c)
User Kernel
Performance improvements consequence ofmore efficient processor execution
27
ApacheBench throughput (1 core)
0 200 400 600 800 10000
5,00010,00015,00020,00025,00030,00035,00040,00045,000
flexscsync
Request Concurrency
Th
rou
gh
pu
t(r
eq
ue
sts/
sec
.)
80-90% improvement
28
ApacheBench throughput (4 cores)
0 200 400 600 800 10000
5,00010,00015,00020,00025,00030,00035,00040,00045,000
flexscsync
Request Concurrency
Th
rou
gh
pu
t(r
eq
ue
sts/
sec
.)
115% improvement
29
Apache latency per client request
sync flexsc sync flexsc sync flexsc0
5
10
15
20
25
30
256 concurrent requests
99th percentileaverage
La
ten
cy
(ms)
1 core 2 cores 4 cores
238
Up to 50% reduction of averagerequest latencies
30
Apache processor metrics
IPCL3
L2d-cache
i-cacheTLB
BranchIPC
L3L2
d-cachei-cache
TLBBranch
0
0.5
1
1.5
2
Apache (1 core)
Re
lati
ve P
erf
orm
an
ce
(fle
xsc
/syn
c)
User Kernel
Processor efficiency doubles for kerneland user-mode execution
31
Discussion
➔New OS architecture not necessary➔ Exception-less syscalls can coexist with legacy ones
➔Foundation for non-blocking system calls➔ select() / poll() in user-space➔ Interesting case of non-blocking free()
➔Multicore ultra-specialization➔ TCP Servers (Rutgers; Iftode et.al), FS Servers
➔Single-ISA asymmetric cores➔ OS-friendly cores (HP Labs; Mogul et. al)
32
Concluding Remarks➔ System calls degrade server performance
➔ Processor pollution is inherent to synchronous system calls
➔ Exception-less syscallsException-less syscalls➔ Flexible and efficient system call execution
➔ FlexSC-ThreadsFlexSC-Threads ➔ Leverages exception-less syscalls➔ No modifications to multi-threaded applications
➔ Throughput & latency gains➔ 2x throughput improvement for Apache and BIND➔ 1.4x throughput improvement for MySQL
FlexSCFlexSCFlexible System Call Scheduling with
Exception-Less System Calls
LivioLivio SoaresSoares and Michael Stumm
University of Toronto