dlibos - performance and protection with a network-on-chip · no timing guarantees (handling tcp...
TRANSCRIPT
DLibOS - Performance and Protection with aNetwork-on-Chip
Stephen Mallon, Vincent Gramoli, Guillaume Jourjon
University of Sydney, Data61
November 21, 2017
Background Approach Evaluation Future work Conclusion
Performance vs Protectionfeatures Kernel kernel bypassPacket rates < 1M pkt/s 10-100M pkts/s99.9th latency (raw packets) >20 us 1 usMemcached tail latency >20ms <500usIsolates IO from application yes no
2/39
Background Approach Evaluation Future work Conclusion
Source of mismatch
IO stack designed around assumption that IO takes milliseconds.▶ System calls and memory copying to perform IO safely in
kernel context▶ Excessive locking within kernel - Memcached profiled to
spend 25% time in kernel spinlocks▶ Lack of core locality, app may process on different cores to
where packets arrive▶ Batching and queueing used to amortise context switch cost
greatly impacts tail latency
3/39
Background Approach Evaluation Future work Conclusion
Modern Networking hardware mismatch
Modern hardware offers features allowing building a more efficientnetwork stack.
▶ SR-IOV: Virtualisation allows partitioned, fast and safe accessto NICs
▶ RSS: core selected from hash(src{ip,port}, dst{ip,port}) -cores do not need to share networking state
▶ Packets placed directly into L3 cache, avoidbatching/queueing
1ATC’124/39
Background Approach Evaluation Future work Conclusion
Berkeley Sockets - Example: Receiving
Kernel
User
Packet
Processed
Queued
Poll
Time
Syscall
Read
+CopySyscall
Arrives
AppProcess
5/39
Background Approach Evaluation Future work Conclusion
Existing approach - Bypassing the OS
Avoid Copies and Context Switches by processing network inuserspace
▶ Existing approaches require dedicating NIC to application▶ No Memory Isolation between application and network▶ Application is trusted not to corrupt network, send invalid
packets
6/39
Background Approach Evaluation Future work Conclusion
Run To Completion
Run to Completion (RTC) processes packets on same core to’completion’ without queueing
▶ Best performance according to conventional wisdom▶ Best temporal locality (packets already in CPU cache)▶ No timing guarantees (handling TCP timeouts/acks) within
appropriate timebound
7/39
Background Approach Evaluation Future work Conclusion
Run To Completion
Packet
Processing
Arrives
AppProcess
Network
8/39
Background Approach Evaluation Future work Conclusion
Manycore Architectures
▶ Increased core count at expense of per-core performance▶ Promises improved energy efficiency▶ Power consumption and cooling are a limiting factor in data
centres▶ Manycore should offer better aggregate performance
9/39
Background Approach Evaluation Future work Conclusion
Many Core TileGX
Cores communicate over network on chip instead of shared memory
. latencyShared Memory >40nsNoC messages 1 cycle per hop
10/39
Background Approach Evaluation Future work Conclusion
Current Approaches
▶ Current approach: Net privileged, app unprivileged - Slow▶ Net and app in same process (unprivileged) - fastest, no
isolation▶ Net in VM (semiprivileged), app unprivileged - isolation, fast
11/39
Background Approach Evaluation Future work Conclusion
Sandstorm
Userspace TCP/IP stack and webserver on top of netmap.▶ Presegmented webpages and stored them in memory.▶ Order of magnitude improvement over Nginx + Linux.
Problems:▶ Doesn’t work for different MTUs, changing window sizes▶ Hardware offloads make presegmentation unnecessary▶ No memory isolation between netstack and application.▶ No timing guarantees (no preemption)
12/39
Background Approach Evaluation Future work Conclusion
IX
▶ Run network stack in VM ring 0, application in VM ring 3▶ Replace Berkeley sockets with a more efficient interface▶ Use context switching to enforce memory isolation▶ Batching to amortise context switching cost▶ Interrupts and preemption to handle timers/ack processing▶ Requires dedicated entire NIC to IX (kernel cannot share
access)Improves on problems of Sandstorm, but is it possible to achievememory isolation and time guarantees without context switching?
13/39
Background Approach Evaluation Future work Conclusion
Design - New ApproachNet App
Net App
App
App
▶ Run network processing in separate user process fromapplication.
▶ Some cores dedicated to network processing, some toapplication processing.
▶ Explicitly share only some memory between net, app and NICto enforce memory isolation.
▶ Communicate using Network on Chip - faster than contextswitching.
▶ Scheduling becomes a layout problem instead of a timemultiplexing problem.
14/39
Background Approach Evaluation Future work Conclusion
Design
(c) DLibOS (shaded area) (d) Traditional user-levellibraries run-to-completion(RTC)
15/39
Background Approach Evaluation Future work Conclusion
Advantages
▶ Better use of per-core resources (Cache, TLBs, instructioncache)
▶ Timing/correctness guarantees - acks/timeouts within atimebound - even when application is stuck
▶ Low latency, Memory isolation without context switching▶ Doesn’t interfere with Linux kernel (Kernel stack uses
separate MAC address on same NIC)▶ Can run multiple stacks - unique MAC address per stack
16/39
Background Approach Evaluation Future work Conclusion
Disadvantages
▶ Complexity - Application must hold onto memory untilremotely acknowledged
▶ Limited by hardware resources (NIC, cores, memory)▶ On-chip network, care needs to be taken to avoid deadlocks,
stalling▶ Less cores for app and network than traditional approach, can
lead to over/under provisioning
17/39
Background Approach Evaluation Future work Conclusion
Shared Memory
App Cores
Net Cores
NIC
App Mem
TX Mem
RX Mem
Read/Write
Read only 18/39
Background Approach Evaluation Future work Conclusion
Scatter Gather IO
HTTP/1.1 200 OK ...Web Page Data
<DOCTYPE html><html ...HTTP headersSend( (ptr, len), (ptr, len))
Pkt hdr
Ref
Pkt hdr
Ref Ref
19/39
Background Approach Evaluation Future work Conclusion
Zero touch
▶ Webserver saves webpages to shared memory region▶ Pass reference to http response to net core▶ Net core passes this to hardware for DMA, without ever
examining the contents of memory▶ Web page memory never touches processor cache (directly
DMAd)▶ Checksums offloaded to hardware
20/39
Background Approach Evaluation Future work Conclusion
Share Nothing Design
▶ Run multiple Network cores▶ Consistent flow hashing means we run a separate network
stack on every net core▶ Each stack does not share any data (ARP, flowtables)▶ Allows network processing to scale linearly with number of net
cores▶ Connections identified by (netcore, id) instead of fd - don’t
need to synchronise an fd table
21/39
Background Approach Evaluation Future work Conclusion
Intercore Messaging
Events (net -> app) Commands (app -> net)new_conn acceptnew_data recv_donedata_acked sendremote_closed shutdowncleanup close
▶ Netcore sends events over on chip network▶ App send commands to net core▶ recv_done and data_acked needed as net/app cannot free
buffer until other side is finished
22/39
Background Approach Evaluation Future work Conclusion
Core layout
▶ Can Adjust number/ratio of network and application coresbased off workload
▶ Spread network cores out to minimise hops▶ Restrict net/app pairs by policy e.g. only within same
quadrant - reduces path contention
23/39
Background Approach Evaluation Future work Conclusion
Core Layout
(e) All-to-all 12:24 (d12-all) (f) quadrant 12:24 (d12-q)
24/39
Background Approach Evaluation Future work Conclusion
Contribution
▶ Driver + Network stack (eth, arp, ip, icmp, tcp) + customsocket layer (not compatible with Berkeley sockets)
▶ Network stack can run as both run-to-completion or dedicatedcores transparent to application
▶ Ported Memcached to run on stack (both run-to-completionand dedicated cores)
▶ In-memory webserver, and some microbenchmarks▶ Benchmarking results
25/39
Background Approach Evaluation Future work Conclusion
Evaluation
▶ Evaluated performance of solution ‘DLibOS’ against ‘RTC’same code in run-to-completion mode
▶ Calls function directly instead of messaging▶ Expect RTC to perform better, but without isolation
26/39
Background Approach Evaluation Future work Conclusion
Testbed
TileGX36
10 Gbps Switch
4x 10Gbps SFP+ cables (bonded)
X86 Client
X86 ClientX86 Client
X86 Client
1x 10Gbps SFP+ cable each
LatencyMeasurements8 X86 ClientsLoad Generation
27/39
Background Approach Evaluation Future work Conclusion
NetPIPE
▶ Tests Throughput and latency of a single connection▶ Used 1 TileGX socket as a client, another as a server▶ Also evaluated NetPIPE without performing a copy per
request▶ 6 us one-way latency 64 byte TCP payload▶ Linux 31 us one-way latency
28/39
Background Approach Evaluation Future work Conclusion
NetPIPE
0 2 4 6 8
10
0 100 200 300 400 500
Goo
dput
(Gbp
s)
Message Size (KB)
LinuxRTC
DLibOS
0 2 4 6 8
10
0 100 200 300 400 500
Goo
dput
(Gbp
s)
Message Size (KB)
LinuxRTC
DLibOS
0 2 4 6 8
10
0 100 200 300 400 500
Goo
dput
(Gbp
s)
Message Size (KB)
LinuxRTC
DLibOS
0
2
4
6
8
10
0 100 200 300 400 500
Goo
dput
(G
bps)
Message Size (KB)
(j) NetPIPE performance
0
2
4
6
8
10
0 100 200 300 400 500
Goo
dput
(G
bps)
Message Size (KB)
(k) NetPIPE (no copy)
Figure: DLibOS outperforms RTC when more work per request is involved
29/39
Background Approach Evaluation Future work Conclusion
Webserver
▶ Simple in memory webserver - only supports get▶ NGINX used as baseline, serve files from ramdisk, disabled
logging▶ Run-to-Completion as upper bound▶ wrk used to generate workload, 2048 connections▶ keepalives enabled, pipelining disabled
30/39
Background Approach Evaluation Future work Conclusion
Webserver Requests per Second 0 500
1000 1500 2000 2500 3000 3500 4000 4500 5000
10 100 1000 10000 100000 1x106 1x107
Thro
ughput (M
B/s
)
Pagesize (Bytes)rtc
d12-alld12-quad
d18-all
d18-quadd24-all
d24-quadnginx
0
1
2
3
4
5
101 102 103 104 105 106
M R
eq/s
Pagesize (Bytes)31/39
Background Approach Evaluation Future work Conclusion
Webserver throughput 0 500
1000 1500 2000 2500 3000 3500 4000 4500 5000
10 100 1000 10000 100000 1x106 1x107
Thro
ughput (M
B/s
)
Pagesize (Bytes)rtc
d12-alld12-quad
d18-all
d18-quadd24-all
d24-quadnginx
0
8
16
24
32
40
101 102 103 104 105 106
Goo
dput
(G
bps)
Pagesize (Bytes)32/39
Background Approach Evaluation Future work Conclusion
Memcached
▶ ‘Mutilate’ benchmark client running on x86 machines, usingkernel stack
▶ Facebook Memcached workloads 1
▶ Baseline memcached running on kernel netstack▶ Modified Memcached to enable ‘zero-touch’ (only IO, no
improvements to core)▶ Set 500us upper bound for 99th percentile - find highest
throughput at that bound▶ Linux was unable to meet the SLA under any load, also
present 5000us 99th percentile▶ 1392 client connections, 32 measuring latency
1SIGMETRICS’1233/39
Background Approach Evaluation Future work Conclusion
Memcached results
0 100 200 300 400 500 600 700
0 500 1000 1500 2000 2500 3000La
ten
cy (
us)
Throughput (RPS x 103)
d12-quad avgd12-quad 99th
d12-all avgd12-all 99th
rtc avgrtc 99th
0 100 200 300 400 500 600 700
0 500 1000 1500 2000 2500 3000Late
ncy
(us)
Throughput (RPS x 103)
d12-quad avgd12-quad 99th
d12-all avgd12-all 99th
rtc avgrtc 99th
0
250
500
750
500 1500 2500
Late
ncy
(µs)
Throughput (RPS x 103)34/39
Background Approach Evaluation Future work Conclusion
RTC vs DLibOS
▶ Expect RTC to outperform specialised cores (Significantly lesswork)
▶ RTC better in webserver, but worse in Memcached▶ Webserver has almost no application work, no lock contention▶ Before optimising Memcached, RTC performed even worse -
Increased lock contention increases tail latency▶ RTC greater impact from lock contention due to using 50%
more cores
35/39
Background Approach Evaluation Future work Conclusion
Memcached Profiling
0
5
10
15
20
25
30
35
40
rtc-nohpquad-nohp
rtc-hpquad-hp
CP
U ti
me
%
item_get
0
4
8
12
16
rtc-nohpquad-nohp
rtc-hpquad-hp
CP
U ti
me
%
Network processing
▶ Profiling of RTC vs DLibOS shows a fairness issue, networkprocessing starved of CPU time by application processing (nopre-emption)
▶ Lock contention dominates workload36/39
Background Approach Evaluation Future work Conclusion
Future Work - X86
Port netstack to x86▶ Use lock free ringbuffers instead of Network on Chip▶ DPDK instead of custom driver▶ Test effect of NUMA architectures
37/39
Background Approach Evaluation Future work Conclusion
Storage
▶ NVMe SSDs have a similar userspace polling mechanism (seeIntel SPDK)
▶ Could run Net, Storage and App as three classes of cores, orcombine Net/Storage into IO core
▶ Port database to IO stack▶ Optimise for DB workload, e.g. append only Write-ahead-log
38/39
Background Approach Evaluation Future work Conclusion
Conclusion
▶ Invalidate commonly held belief that user-level IO cannotachieve protection and performance
▶ It is possible to achieve memory isolation, correct timeguarantees without context switches and preemption
▶ Significantly outperforms Linux kernel▶ Run-to-Completion can be outperformed by specialised cores -
contradicts conventional wisdom
39/39