dlibos - performance and protection with a network-on-chip · no timing guarantees (handling tcp...

DLibOS - Performance and Protection with aNetwork-on-Chip

Stephen Mallon, Vincent Gramoli, Guillaume Jourjon

University of Sydney, Data61

November 21, 2017

Background Approach Evaluation Future work Conclusion

Performance vs Protectionfeatures Kernel kernel bypassPacket rates < 1M pkt/s 10-100M pkts/s99.9th latency (raw packets) >20 us 1 usMemcached tail latency >20ms <500usIsolates IO from application yes no

2/39


Source of mismatch

IO stack designed around assumption that IO takes milliseconds.▶ System calls and memory copying to perform IO safely in

kernel context▶ Excessive locking within kernel - Memcached profiled to

spend 25% time in kernel spinlocks▶ Lack of core locality, app may process on different cores to

where packets arrive▶ Batching and queueing used to amortise context switch cost

greatly impacts tail latency

3/39


Modern Networking hardware mismatch

Modern hardware offers features allowing building a more efficientnetwork stack.

▶ SR-IOV: Virtualisation allows partitioned, fast and safe accessto NICs

▶ RSS: core selected from hash(src{ip,port}, dst{ip,port}) -cores do not need to share networking state

▶ Packets placed directly into L3 cache, avoidbatching/queueing

1ATC’124/39


Berkeley Sockets - Example: Receiving

Kernel

User

Packet

Processed

Queued

Poll

Time

Syscall

Read

+CopySyscall

Arrives

AppProcess

5/39


Existing approach - Bypassing the OS

Avoid Copies and Context Switches by processing network inuserspace

▶ Existing approaches require dedicating NIC to application▶ No Memory Isolation between application and network▶ Application is trusted not to corrupt network, send invalid

packets

6/39


Run To Completion

Run to Completion (RTC) processes packets on same core to’completion’ without queueing

▶ Best performance according to conventional wisdom▶ Best temporal locality (packets already in CPU cache)▶ No timing guarantees (handling TCP timeouts/acks) within

appropriate timebound

7/39


Run To Completion

Packet

Processing

Arrives

AppProcess

Network

8/39


Manycore Architectures

▶ Increased core count at expense of per-core performance▶ Promises improved energy efficiency▶ Power consumption and cooling are a limiting factor in data

centres▶ Manycore should offer better aggregate performance

9/39


Many Core TileGX

Cores communicate over network on chip instead of shared memory

. latencyShared Memory >40nsNoC messages 1 cycle per hop

10/39


Current Approaches

▶ Current approach: Net privileged, app unprivileged - Slow▶ Net and app in same process (unprivileged) - fastest, no

isolation▶ Net in VM (semiprivileged), app unprivileged - isolation, fast

11/39


Sandstorm

Userspace TCP/IP stack and webserver on top of netmap.▶ Presegmented webpages and stored them in memory.▶ Order of magnitude improvement over Nginx + Linux.

Problems:▶ Doesn’t work for different MTUs, changing window sizes▶ Hardware offloads make presegmentation unnecessary▶ No memory isolation between netstack and application.▶ No timing guarantees (no preemption)

12/39


IX

▶ Run network stack in VM ring 0, application in VM ring 3▶ Replace Berkeley sockets with a more efficient interface▶ Use context switching to enforce memory isolation▶ Batching to amortise context switching cost▶ Interrupts and preemption to handle timers/ack processing▶ Requires dedicated entire NIC to IX (kernel cannot share

access)Improves on problems of Sandstorm, but is it possible to achievememory isolation and time guarantees without context switching?

13/39


Design - New ApproachNet App

Net App

App

App

▶ Run network processing in separate user process fromapplication.

▶ Some cores dedicated to network processing, some toapplication processing.

▶ Explicitly share only some memory between net, app and NICto enforce memory isolation.

▶ Communicate using Network on Chip - faster than contextswitching.

▶ Scheduling becomes a layout problem instead of a timemultiplexing problem.

14/39


Design

(c) DLibOS (shaded area) (d) Traditional user-levellibraries run-to-completion(RTC)

15/39


Advantages

▶ Better use of per-core resources (Cache, TLBs, instructioncache)

▶ Timing/correctness guarantees - acks/timeouts within atimebound - even when application is stuck

▶ Low latency, Memory isolation without context switching▶ Doesn’t interfere with Linux kernel (Kernel stack uses

separate MAC address on same NIC)▶ Can run multiple stacks - unique MAC address per stack

16/39


Disadvantages

▶ Complexity - Application must hold onto memory untilremotely acknowledged

▶ Limited by hardware resources (NIC, cores, memory)▶ On-chip network, care needs to be taken to avoid deadlocks,

stalling▶ Less cores for app and network than traditional approach, can

lead to over/under provisioning

17/39


Shared Memory

App Cores

Net Cores

NIC

App Mem

TX Mem

RX Mem

Read/Write

Read only 18/39


Scatter Gather IO

HTTP/1.1 200 OK ...Web Page Data

<DOCTYPE html><html ...HTTP headersSend( (ptr, len), (ptr, len))

Pkt hdr

Ref

Pkt hdr

Ref Ref

19/39


Zero touch

▶ Webserver saves webpages to shared memory region▶ Pass reference to http response to net core▶ Net core passes this to hardware for DMA, without ever

examining the contents of memory▶ Web page memory never touches processor cache (directly

DMAd)▶ Checksums offloaded to hardware

20/39


Share Nothing Design

▶ Run multiple Network cores▶ Consistent flow hashing means we run a separate network

stack on every net core▶ Each stack does not share any data (ARP, flowtables)▶ Allows network processing to scale linearly with number of net

cores▶ Connections identified by (netcore, id) instead of fd - don’t

need to synchronise an fd table

21/39


Intercore Messaging

Events (net -> app) Commands (app -> net)new_conn acceptnew_data recv_donedata_acked sendremote_closed shutdowncleanup close

▶ Netcore sends events over on chip network▶ App send commands to net core▶ recv_done and data_acked needed as net/app cannot free

buffer until other side is finished

22/39


Core layout

▶ Can Adjust number/ratio of network and application coresbased off workload

▶ Spread network cores out to minimise hops▶ Restrict net/app pairs by policy e.g. only within same

quadrant - reduces path contention

23/39


Core Layout

(e) All-to-all 12:24 (d12-all) (f) quadrant 12:24 (d12-q)

24/39


Contribution

▶ Driver + Network stack (eth, arp, ip, icmp, tcp) + customsocket layer (not compatible with Berkeley sockets)

▶ Network stack can run as both run-to-completion or dedicatedcores transparent to application

▶ Ported Memcached to run on stack (both run-to-completionand dedicated cores)

▶ In-memory webserver, and some microbenchmarks▶ Benchmarking results

25/39


Evaluation

▶ Evaluated performance of solution ‘DLibOS’ against ‘RTC’same code in run-to-completion mode

▶ Calls function directly instead of messaging▶ Expect RTC to perform better, but without isolation

26/39


Testbed

TileGX36

10 Gbps Switch

4x 10Gbps SFP+ cables (bonded)

X86 Client

X86 ClientX86 Client

X86 Client

1x 10Gbps SFP+ cable each

LatencyMeasurements8 X86 ClientsLoad Generation

27/39


NetPIPE

▶ Tests Throughput and latency of a single connection▶ Used 1 TileGX socket as a client, another as a server▶ Also evaluated NetPIPE without performing a copy per

request▶ 6 us one-way latency 64 byte TCP payload▶ Linux 31 us one-way latency

28/39


NetPIPE

0 2 4 6 8

10

0 100 200 300 400 500

Goo

dput

(Gbp

s)

Message Size (KB)

LinuxRTC

DLibOS

0 2 4 6 8

10

0 100 200 300 400 500

Goo

dput

(Gbp

s)

Message Size (KB)

LinuxRTC

DLibOS

0 2 4 6 8

10

0 100 200 300 400 500

Goo

dput

(Gbp

s)

Message Size (KB)

LinuxRTC

DLibOS

0

2

4

6

8

10

0 100 200 300 400 500

Goo

dput

(G

bps)

Message Size (KB)

(j) NetPIPE performance

0

2

4

6

8

10

0 100 200 300 400 500

Goo

dput

(G

bps)

Message Size (KB)

(k) NetPIPE (no copy)

Figure: DLibOS outperforms RTC when more work per request is involved

29/39


Webserver

▶ Simple in memory webserver - only supports get▶ NGINX used as baseline, serve files from ramdisk, disabled

logging▶ Run-to-Completion as upper bound▶ wrk used to generate workload, 2048 connections▶ keepalives enabled, pipelining disabled

30/39


Webserver Requests per Second 0 500

1000 1500 2000 2500 3000 3500 4000 4500 5000

10 100 1000 10000 100000 1x106 1x107

Thro

ughput (M

B/s

)

Pagesize (Bytes)rtc

d12-alld12-quad

d18-all

d18-quadd24-all

d24-quadnginx

0

1

2

3

4

5

101 102 103 104 105 106

M R

eq/s

Pagesize (Bytes)31/39


Webserver throughput 0 500

1000 1500 2000 2500 3000 3500 4000 4500 5000

10 100 1000 10000 100000 1x106 1x107

Thro

ughput (M

B/s

)

Pagesize (Bytes)rtc

d12-alld12-quad

d18-all

d18-quadd24-all

d24-quadnginx

0

8

16

24

32

40

101 102 103 104 105 106

Goo

dput

(G

bps)

Pagesize (Bytes)32/39


Memcached

▶ ‘Mutilate’ benchmark client running on x86 machines, usingkernel stack

▶ Facebook Memcached workloads 1

▶ Baseline memcached running on kernel netstack▶ Modified Memcached to enable ‘zero-touch’ (only IO, no

improvements to core)▶ Set 500us upper bound for 99th percentile - find highest

throughput at that bound▶ Linux was unable to meet the SLA under any load, also

present 5000us 99th percentile▶ 1392 client connections, 32 measuring latency

1SIGMETRICS’1233/39


Memcached results

0 100 200 300 400 500 600 700

0 500 1000 1500 2000 2500 3000La

ten

cy (

us)

Throughput (RPS x 103)

d12-quad avgd12-quad 99th

d12-all avgd12-all 99th

rtc avgrtc 99th

0 100 200 300 400 500 600 700

0 500 1000 1500 2000 2500 3000Late

ncy

(us)

Throughput (RPS x 103)

d12-quad avgd12-quad 99th

d12-all avgd12-all 99th

rtc avgrtc 99th

0

250

500

750

500 1500 2500

Late

ncy

(µs)

Throughput (RPS x 103)34/39


RTC vs DLibOS

▶ Expect RTC to outperform specialised cores (Significantly lesswork)

▶ RTC better in webserver, but worse in Memcached▶ Webserver has almost no application work, no lock contention▶ Before optimising Memcached, RTC performed even worse -

Increased lock contention increases tail latency▶ RTC greater impact from lock contention due to using 50%

more cores

35/39


Memcached Profiling

0

5

10

15

20

25

30

35

40

rtc-nohpquad-nohp

rtc-hpquad-hp

CP

U ti

me

%

item_get

0

4

8

12

16

rtc-nohpquad-nohp

rtc-hpquad-hp

CP

U ti

me

%

Network processing

▶ Profiling of RTC vs DLibOS shows a fairness issue, networkprocessing starved of CPU time by application processing (nopre-emption)

▶ Lock contention dominates workload36/39


Future Work - X86

Port netstack to x86▶ Use lock free ringbuffers instead of Network on Chip▶ DPDK instead of custom driver▶ Test effect of NUMA architectures

37/39


Storage

▶ NVMe SSDs have a similar userspace polling mechanism (seeIntel SPDK)

▶ Could run Net, Storage and App as three classes of cores, orcombine Net/Storage into IO core

▶ Port database to IO stack▶ Optimise for DB workload, e.g. append only Write-ahead-log

38/39


Conclusion

▶ Invalidate commonly held belief that user-level IO cannotachieve protection and performance

▶ It is possible to achieve memory isolation, correct timeguarantees without context switches and preemption

▶ Significantly outperforms Linux kernel▶ Run-to-Completion can be outperformed by specialised cores -

contradicts conventional wisdom

39/39

dlibos - performance and protection with a network-on-chip · no timing guarantees (handling tcp...

Documents