crimson - usenix · - high writes amplification, bad for qlc flash - background garbage...

35
A new osd for the age of persistent memory and fast nvme storage Crimson

Upload: others

Post on 24-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

A new osd for the age of persistent memory and fast nvme storage

Crimson

Page 2: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

2

Storage Keeps Getting Faster

Page 3: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

3

CPUs, not so much

With a cpu clocked at 3ghz you can afford:

- HDD: ~20 million cycles/IO- SSD: 300,000 cycles/IO- NVME: 6000 cycles/IO

Page 4: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

4

Ceph Architecture

Ceph Components

RGWweb services gateway for

object storage, compatible with S3 and

Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 5: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

5

RADOS

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 6: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

6

OSD Threading Architecture

Messenger

Worker ThreadWorker Thread

Worker Thread

PG

Worker ThreadWorker Thread

Worker Thread

ObjectStore

Worker ThreadWorker Thread

Worker Threadosd op queue kv queue

Messenger

Worker ThreadWorker Thread

Worker Threadout_queue

Page 7: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

7

Classic OSD Threads

Page 8: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

8

Classic OSD Threads

Page 9: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

9

Classic OSD Threads

Page 10: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

10

Classic OSD Threads

Page 11: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

11

crimson-osd aims to be a replacement osd daemon with the following goals:

- Minimize cpu overhead

- Minimize cycles/iop

- Minimize cross-core communication

- Minimize copies

- Bypass kernel, avoid context switches

- Enable emerging storage technologies

- Zoned Namespaces

- Persistent Memory

- Fast NVME

Crimson!

Page 12: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

12

- Single thread per cpu

- Non-blocking IO

- Scheduling done in user space

- Includes direct support for DPDK, a high performance library for userspace networking

Seastar

Page 13: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

13

Programming Model

for (;;) { auto req = connection.read_request(); auto result = handle_request(req); connection.send_response(result);}

repeat([conn, this] { return conn.read_request().then([this](auto req) { return handle_request(req); }).then([conn] (auto result) { return conn.send_response(result); });});

Page 14: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

14

OSD Threading Architecture

Messenger

Worker ThreadWorker Thread

Worker Thread

PG

Worker ThreadWorker Thread

Worker Thread

ObjectStore

Worker ThreadWorker Thread

Worker Threadosd op queue kv queue

Messenger

Worker ThreadWorker Thread

Worker Threadout_queue

Page 15: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

15

Seastar Reactor Thread

Crimson Threading Architecture

Messenger PG ObjectStore

async wait async wait

Messengerasync wait

Seastar Reactor Thread

Messenger PG ObjectStore

async wait async wait

Messengerasync wait

Seastar Reactor Thread

Messenger PG ObjectStore

async wait async wait

Messengerasync wait

Seastar Reactor Thread

Messenger PG ObjectStore

async wait async wait

Messengerasync wait

Seastar Reactor Thread

Messenger PG ObjectStore

async wait async wait

Messengerasync wait

Page 16: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

16

Crimson Threading Architecture

Seastar Reactor Thread

1.12.4

1.f

Seastar Reactor Thread

1.2

2.7 2.9

Seastar Reactor Thread

2.42.d

1.3

Page 17: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

17

Crimson Threading Architecture

Seastar Reactor Thread

1.1 2.4 1.f

Seastar Reactor Thread

1.22.7

2.9

Seastar Reactor Thread

2.4

2.d1.3

Incoming message ?

Page 18: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

18

Crimson Threading Architecture

Seastar Reactor Thread

1.1 2.4 1.f

Seastar Reactor Thread

1.22.7

2.9

Seastar Reactor Thread

2.4

2.d1.3

Incoming message (2.4)

Page 19: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

19

Crimson Threading Architecture

Seastar Reactor Thread

1.1 2.4 1.f

Seastar Reactor Thread

1.22.7

2.9

Seastar Reactor Thread

2.4

2.d1.3Incoming message (2.4)

10001

10002

10003

Page 20: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

20

Crimson Threading Architecture

OSD 1

1.1 2.4 1.f

OSD 2

1.22.7

2.9

OSD 3

2.4

2.d1.3Incoming message (2.4)

Page 21: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

21

- Working:

- Peering

- Replication

- Support for rbd based benchmarking

- Memory-only ObjectStore shim

- In Progress:

- Recovery/backfill

- BlueStore support (via alien) -- almost ready to merge

- Seastore implementation in progress

Current Status

Page 22: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

22

Performance

Page 23: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

23

Performance

Page 24: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

24

Performance

Page 25: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

25

BlueStore - Alien

alien threadsseastar reactor

OSD

1.0

1.2

1.4

1.6

1.8

Page 26: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

26

BlueStore - Seastar Native

Allo

cato

r

datametadata

RocksDB

SeastarEnv

BlueFS*

Seastar AIO

Page 27: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

27

- Targets next generation storage technologies:

- Zone Namespaces

- Persistent Memory

- Independent metadata and allocation structures for each reactor to avoid synchronization

Seastore

Page 28: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

28

- New NVMe Specification

- Intended to address challenges with conventional FTL designs

- High writes amplification, bad for qlc flash

- Background garbage collection tends to impact tail latencies

- Different interface

- Drive divided into zones

- Zones can only be opened, written sequentially, closed, and released.

- As it happens, this kind of write pattern tends to be good for conventional ssds as well.

Seastore - ZNS

Page 29: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

29

- Transactional

- Composed of flat object namespace

- Object names may be large (>1k)

- Each object contains a key->value mapping (string->bytes) as well as a data payload.

- Supports COW object clones

- Supports ordered listing of both the omap and the object namespace

ObjectStore

Page 30: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

30

Seastore - Logical Structure

Root

onode_by_hobject lba_tree

onode

block_info

omap_tree extent_tree

lba_t -> block_info_thobject_t -> onode_t

Page 31: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

31

Seastore - Why use an LBA indirection?

A

B

C C’

When GCing a C, we need to do 2 things:1. Find all incoming references (B in this case)2. Write out a transaction updating (and dirtying) those

references as well as writing out the new block C’

Using direct references means we still need to maintain some means of finding the parent references, and we pay the cost of updating the relatively low fanout onode and omap trees.

By contrast, using an LBA indirection requires extra reads in the lookup path, but potentially decreases reads and write amplification during gc.

Page 32: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

32

Seastore - Layout

Journal Segments

header delta D’ ...delta E’ (logical) block B ... ...record(physical) block A

ED

A

E’D’

B

LBA Tree LBA Tree

Page 33: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

33

Seastore - ZNS

Journal Segments

header delta D’ ...delta E’ (logical) block B ... ...record(physical) block A

ED

A

E’D’

B

LBA Tree LBA Tree

Page 34: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

34

- Optane now available

- Almost DRAM like read latencies

- Write latency drastically lower than flash

- Seems like a good fit for caching data and metadata!

- Reads from persistent memory can simply return a ready future without waiting at all.

- Likely to be integrated into seastore as a replacement for metadata structures/journal deltas

- In particular, could be used to replace the lba mapping.

Seastore - Persistent Memory

Page 35: Crimson - USENIX · - High writes amplification, bad for qlc flash - Background garbage collection tends to impact tail latencies - Different interface - Drive divided into zones

35

- Roadmap: https://github.com/ceph/ceph-notes/blob/master/crimson/status.rst

- Project Tracker: https://github.com/ceph/ceph/projects/2

- Documentation: https://github.com/ceph/ceph/blob/master/doc/dev/crimson.rst

- Seastar tutorial: https://github.com/scylladb/seastar/wiki/Seastar-Tutorial

Questions?