ceph snapshots: diving into deep waters · ceph projects rgw s3 and swift compatible object storage...
TRANSCRIPT
![Page 1: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/1.jpg)
Ceph Snapshots: Diving into Deep Waters
Greg Farnum – Red hatVault – 2017.03.23
![Page 3: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/3.jpg)
3
● RADOS, RBD, CephFS: (Lightning) overview and how writes happen
● The (self-managed) snapshots interface
● A diversion into pool snapshots
● Snapshots in RBD, CephFS
● RADOS/OSD Snapshot implementation, pain points
Outline
![Page 4: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/4.jpg)
4
Ceph’s Past & Present
● Then: UC Santa Cruz Storage Research Systems Center
● Long-term research project in petabyte-scale storage
● trying to develop a Lustre successor.
● Now: Red Hat, a commercial open-source software & support provider you might have heard of :)
(Mirantis, SuSE, Canonical, 42on, Hastexo, ...)
● Building a business; customers in virtual block devices and object storage
● ...and reaching for filesystem users!
![Page 5: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/5.jpg)
5
Ceph Projects
RGWS3 and Swift compatible
object storage with object versioning, multi-site
federation, and replication
LIBRADOSA library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomic, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons)
RBDA virtual block device with snapshots, copy-on-write
clones, and multi-site replication
CEPHFSA distributed POSIX file system with coherent
caches and snapshots on any directory
OBJECT BLOCK FILE
![Page 6: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/6.jpg)
RADOS: Overview
![Page 7: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/7.jpg)
7
RADOS Components
7
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number These do not serve stored objects to clients
M
![Page 8: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/8.jpg)
8
Object Storage Daemons
8
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
M
M
M
![Page 9: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/9.jpg)
9
CRUSH: Dynamic Data Placement
9
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
![Page 10: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/10.jpg)
10
DATA IS ORGANIZED INTO POOLS
10
CLUSTER
OBJECTS
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
POOLS(CONTAINING PGs)
10
01
11
01
10
01
01
10
01
10
10
01
11
01
10
01
10 01 10 11
01
11
01
10
10
01
01
01
10
10
01
01
POOLA
POOLB
POOL C
POOLD
OBJECTS
OBJECTS
OBJECTS
![Page 11: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/11.jpg)
11
L
librados: RADOS Access for Apps
11
LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead
Rich object API Bytes, attributes, key/value data Partial overwrite of existing data Single-object compound atomic operations RADOS classes (stored procedures)
![Page 12: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/12.jpg)
12
aio_write(const object_t &oid, AioCompletionImpl *c,
const bufferlist& bl, size_t len, uint64_t off);
c->wait_for_safe();
write(const std::string& oid, bufferlist& bl, size_t len, uint64_t off)
RADOS: The Write Path (user)
![Page 13: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/13.jpg)
13
L
RADOS: The Write Path (Network)
13
Client Primary Replica
![Page 14: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/14.jpg)
14
● Queue write for PG
● Lock PG
● Assign order to write op
● Package it for persistent storage
– Find current object state, etc
● Send to replica op
● Send to local persistent storage
● Unlock PG
● Wait for commits from persistent storage and replicas
● Send commit back to client
RADOS: The Write Path (OSD)
![Page 15: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/15.jpg)
RBD: Overview
![Page 16: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/16.jpg)
16
STORING VIRTUAL DISKS
16
M M
RADOS CLUSTER
HYPERVISORLIBRBD
VM
![Page 17: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/17.jpg)
17
RBD STORES VIRTUAL DISKS
17
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula, Proxmox
![Page 18: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/18.jpg)
18
ssize_t Image::write(uint64_t ofs, size_t len, bufferlist& bl)
int Image::aio_write(uint64_t off, size_t len, bufferlist& bl,
RBD::AioCompletion *c)
RBD: The Write Path
![Page 19: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/19.jpg)
CephFS: Overview
![Page 20: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/20.jpg)
20
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 0110
Ceph-fuse, samba, Ganesha
![Page 21: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/21.jpg)
21
extern "C" int ceph_write(struct ceph_mount_info *cmount, int fd, const char *buf, int64_t size, int64_t offset)
CephFS: The Write Path (User)
![Page 22: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/22.jpg)
22
L
CephFS: The Write Path (Network)
22
Client MDS OSD
![Page 23: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/23.jpg)
23
● Request write capability from MDS if not already present
● Get “cap” from MDS
● Write new data to “ObjectCacher”
● (Inline or later when flushing)
– Send write to OSD
– Receive commit from OSD
● Return to caller
CephFS: The Write Path
![Page 24: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/24.jpg)
The Origin of Snapshots
![Page 25: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/25.jpg)
25
[john@schist backups]$ touch history[john@schist backups]$ cd .snap[john@schist .snap]$ mkdir snap1[john@schist .snap]$ cd ..[john@schist backups]$ rm -f history[john@schist backups]$ ls[john@schist backups]$ ls .snap/snap1history# Deleted file still there in the snapshot!
![Page 26: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/26.jpg)
26
● For CephFS
– Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting together
● Must be cheap to create
● We have external storage for any desired snapshot metadata
Snapshot Design: Goals & Limits
![Page 27: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/27.jpg)
27
● Snapshots are per-object
● Driven on object write
– So snaps which logically apply to any object don’t touch it if it’s not written
● Very skinny data
– per-object list of existing snaps
– Global list of deleted snaps
Snapshot Design: Outcome
![Page 28: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/28.jpg)
RADOS: “Self-managed” snapshots
![Page 29: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/29.jpg)
29
int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps);
int selfmanaged_snap_create(uint64_t *snapid);
void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c);
int selfmanaged_snap_remove(uint64_t snapid);
void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c);
int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid);
Librados snaps interface
![Page 30: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/30.jpg)
30
“snapids” are allocated by incrementing the “snapid” and “snap_seq” members of the per-pool “pg_pool_t” OSDMap struct
Allocating Self-managed Snapshots
![Page 31: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/31.jpg)
31
L
Allocating Self-managed Snapshots
31
Client Monitor
M M
Peons
Disk commit
![Page 32: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/32.jpg)
32
L
Allocating Self-managed Snapshots
32
Client Monitor
M M
Peons
Disk commit
...or just make them up yourself (CephFS does so in the MDS)
![Page 33: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/33.jpg)
33
int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps);
int selfmanaged_snap_create(uint64_t *snapid);
void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c);
int selfmanaged_snap_remove(uint64_t snapid);
void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c);
int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid);
Librados snaps interface
![Page 34: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/34.jpg)
34
L
Writing With Snapshots
34
Client Primary Replica
write(const std::string& oid, bufferlist& bl, size_t len, uint64_t off)
![Page 35: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/35.jpg)
35
● Queue write for PG
● Lock PG
● Assign order to write op
● Package it for persistent storage
– Find current object state, etc
– make_writeable()
● Send to replica op
● Send to local persistent storage
● Wait for commits from persistent storage and replicas
● Send commit back to client
Snapshots: The OSD Path
![Page 36: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/36.jpg)
36
● The PrimaryLogPG::make_writeable() function
– Is the “SnapContext” newer than the object already has on disk?
– (Create a transaction to) clone the existing object
– Update the stats and clone range overlap information
● PG::append_log() calls update_snap_map()
– Updates the “SnapMapper”, which maintains LevelDB entries from:
● snapid → object● And Object → snapid
Snapshots: The OSD Path
![Page 37: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/37.jpg)
37
struct SnapSet {
snapid_t seq;
bool head_exists;
vector<snapid_t> snaps; // descending
vector<snapid_t> clones; // ascending
map<snapid_t, interval_set<uint64_t> > clone_overlap;
map<snapid_t, uint64_t> clone_size;
}
● This is attached to the “HEAD” object in an xattr
Snapshots: OSD Data Structures
![Page 38: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/38.jpg)
RADOS: Pool Snapshots :(
![Page 39: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/39.jpg)
39
● Make snapshots “easy” for admins
● Leverage the existing per-object implementation
– Overlay the correct SnapContext automatically on writes
– Spread that SnapContext via the OSDMap
Pool Snaps: Desire
![Page 40: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/40.jpg)
40
int snap_list(vector<uint64_t> *snaps);
int snap_lookup(const char *name, uint64_t *snapid);
int snap_get_name(uint64_t snapid, std::string *s);
int snap_get_stamp(uint64_t snapid, time_t *t);
int snap_create(const char* snapname);
int snap_remove(const char* snapname);
int rollback(const object_t& oid, const char *snapName);
– Note how that’s still per-object!
Librados pool snaps interface
![Page 41: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/41.jpg)
41
● “Spread that SnapContext via the OSDMap”
– It’s not a point-in-time snapshot
– SnapContext spread virally as OSDMaps get pushed out
– No guaranteed temporal order between two different RBD volumes in the pool – even when attached to the same VM :(
● Inflates the OSDMap size:
per-pool map<snapid_t, pool_snap_info_t> snaps;
struct pool_snap_info_t { snapid_t snapid; utime_t stamp; string name; }
● They are unlikely to solve a problem you want
Pool Snaps: Reality
![Page 42: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/42.jpg)
42
● “Overlay the correct SnapContext automatically on writes”
– No sensible way to merge that with a self-managed SnapContext
– ...so we don’t: pick one or the other for a pool
All in all, pool snapshots are unlikely to usefully solve any problems.
Pool Snaps: Reality
![Page 43: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/43.jpg)
RBD: Snapshot Structures
![Page 44: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/44.jpg)
44
RBD Snapshots: Data Structures
struct cls_rbd_snap {
snapid_t id;
string name;
uint64_t image_size;
uint64_t features;
uint8_t protection_status;
cls_rbd_parent parent;
uint64_t flags;
utime_t timestamp;
cls::rbd::SnapshotNamespaceOnDisk snapshot_namespace;
}
![Page 45: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/45.jpg)
45
RBD Snapshots: Data Structures
● cls_rbd_snap for every snapshot
● Stored in “omap” (read: LevelDB) key-value space on the RBD volume’s header object
● RBD object class exposes get_snapcontext() function, called on mount
● RBD clients “watch” on the header, get “notify” when a new snap is created to update themselves
![Page 46: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/46.jpg)
CephFS: Snapshot Structures
![Page 47: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/47.jpg)
47
● For CephFS
– Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting together
● Must be cheap to create
● We have external storage for any desired snapshot metadata
CephFS Snapshots: Goals & Limits
![Page 48: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/48.jpg)
48
CephFS Snapshots: Memory
● Directory “Cinodes” have “SnapRealms”
● Important elements:
snapid_t seq; // a version/seq # for changes to _this_ realm.
snapid_t created; // when this realm was created.
snapid_t last_created; // last snap created in _this_ realm.
snapid_t last_destroyed; // seq for last removal
snapid_t current_parent_since;
map<snapid_t, SnapInfo> snaps;
map<snapid_t, snaplink_t> past_parents; // key is "last" (or NOSNAP)
![Page 49: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/49.jpg)
49
/home/home
greggreg sagesage ricric
CephFS Snapshots: SnapRealms
![Page 50: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/50.jpg)
50
/home/home
greggreg sagesage ricric
CephFS Snapshots: SnapRealms
![Page 51: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/51.jpg)
51
/home/home
greggreg sagesage ricric
CephFS Snapshots: SnapRealms
![Page 52: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/52.jpg)
52
/home/home
greggreg sagesage ricric
CephFS Snapshots: SnapRealms
![Page 53: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/53.jpg)
53
CephFS Snapshots: SnapRealms
● Directory “Cinodes” have “SnapRealms”
● Important elements:
snapid_t seq; // a version/seq # for changes to _this_ realm.
snapid_t created; // when this realm was created.
snapid_t last_created; // last snap created in _this_ realm.
snapid_t last_destroyed; // seq for last removal
snapid_t current_parent_since;
map<snapid_t, SnapInfo> snaps;
map<snapid_t, snaplink_t> past_parents; // key is "last" (or NOSNAP)
Construct the SnapContext!
![Page 54: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/54.jpg)
54
CephFS Snapshots: Memory
● All “CInodes” have “old_inode_t” map representing its past states for snapshots
struct old_inode_t {
snapid_t first;
inode_t inode;
std::map<string,bufferptr> xattrs;
}
![Page 55: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/55.jpg)
55
CephFS Snapshots: Disk
● SnapRealms are encoded as part of inode
● Snapshotted metadata stored as old_inode_t map in memory/disk
● Snapshotted data stored in RADOS object self-managed snapshots
/<v2>/home<v5>/greg<v9>/foo
foo -> ino 1342, 4 MB, [<1>,<3>,<10>]
bar -> ino 1001, 1024 KBytes
baz -> ino 1242, 2 MB
/<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/
Mydir[01], total size 7MB
1342.0/HEAD
![Page 56: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/56.jpg)
56
CephFS Snapshots
● Arbitrary sub-tree snapshots of the hierarchy
● Metadata stored as old_inode_t map in memory/disk
● Data stored in RADOS object snapshots
/<v2>/home<v5>/greg<v9>/foo
1342.0/HEAD
/<v1>/home<v3>/greg<v7>/foo
1342.0/1
![Page 57: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/57.jpg)
CephFS: Snapshot Pain
![Page 58: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/58.jpg)
58
CephFS Pain: Opening past parents
● Directory “Cinodes” have “SnapRealms”
● Important elements:
snapid_t seq; // a version/seq # for changes to _this_ realm.
snapid_t created; // when this realm was created.
snapid_t last_created; // last snap created in _this_ realm.
snapid_t last_destroyed; // seq for last removal
snapid_t current_parent_since;
map<snapid_t, SnapInfo> snaps;
map<snapid_t, snaplink_t> past_parents;
![Page 59: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/59.jpg)
59
CephFS Pain: Opening past parents
● To construct the SnapContext for a write, we need the all the SnapRealms it has ever participated in
– Because it could have been logically snapshotted in an old location but not written to since, and a new write must reflect that old location’s snapid
● So we must open all the directories the file has been a member of!
– With a single MDS, this isn’t too hard
– With multi-MDS, this can be very difficult in some scenarios
● We may not know who is “authoritative” for a directory under all failure and recovery scenarios
● If there’s been a disaster metadata may be inaccessible, but we don’t have mechanisms for holding operations and retrying when “unrelated” metadata is inaccessible
![Page 60: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/60.jpg)
60
CephFS Pain: Opening past parents
● Directory “Cinodes” have “SnapRealms”
● Important elements:
snapid_t seq; // a version/seq # for changes to _this_ realm.
snapid_t created; // when this realm was created.
snapid_t last_created; // last snap created in _this_ realm.
snapid_t last_destroyed; // seq for last removal
snapid_t current_parent_since;
map<snapid_t, SnapInfo> snaps;
map<snapid_t, snaplink_t> past_parents;
Why not store snaps in all descendantsInstead of maintaining ancestor links?
![Page 61: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/61.jpg)
61
CephFS Pain: Eliminating past parents
● The MDS opens an inode for any operation performed on it
– This includes its SnapRealm
● So we can merge snapid lists down whenever we open an inode that has a new SnapRealm
● So if we rename a directory/file into a new location; its SnapRealm already contains all the right snapids and then we don’t need a link to the past!
● I got this almost completely finished
– Reduced code line count
– Much simpler snapshot tracking code
– But….
![Page 62: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/62.jpg)
62
CephFS Pain: Hard links
● Hard links and snapshots do not interact :(
● They should!
● That means we need to merge SnapRealms from all the linked parents of an inode
– And this is the exact same problem we have with past_parents
– Since we need to open “remote” inodes correctly, avoiding it in the common case doesn’t help us
● So, back to debugging and more edge cases
![Page 63: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/63.jpg)
RADOS: Deleting Snapshots
![Page 64: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/64.jpg)
64
int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps);
int selfmanaged_snap_create(uint64_t *snapid);
void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c);
int selfmanaged_snap_remove(uint64_t snapid);
void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c);
int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid);
Librados snaps interface
![Page 65: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/65.jpg)
65
L
“Deleting” Snapshots (Client)
65
Client Monitor
M M
Peons
Disk commit
![Page 66: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/66.jpg)
66
● Generate new OSDMap updating pg_pool_t
interval_set<snapid_t> removed_snaps;
● This is really space-efficient if you consistently delete your oldest snapshots!
– Rather less so if you keep every other one forever
● ...and this looks sort of like some sensible RBD snapshot strategies (daily for a week, weekly for a month, monthly for a year)
Deleting Snapshots (Monitor)
![Page 67: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/67.jpg)
67
● OSD advances its OSDMap
● Asynchronously:
– List objects with that snapshot via “SnapMapper”
● int get_next_objects_to_trim( snapid_t snap, unsigned max, vector<hobject_t> *out);
– For each object:
● “unlink” object clone for that snapshot – 1 coalescable IO
– Sometimes clones belong to multiple snaps so we might not delete right away● Update object HEAD’s “SnapSet” xattr – 1+ unique IO● Remove SnapMapper’s LevelDB entries for that object/snap pair – 1 coalescable IO● Write down “PGLog” entry recording clone removal – 1 coalescable IO
– Note that if you trim a bunch of snaps, you do this for each one – no coalescing it down to one pass on each object :(
Deleting Snapshots (OSD)
![Page 68: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/68.jpg)
68
● So that’s at least 1 IO per object in a snap
– potentially a lot more if we needed to fetch KV data off disk, didn’t have directories cached, etc
– This will be a lot better in BlueStore! It’s just coalescable metadata ops
● Ouch!
● Even worse: throttling is hard
– Why is a whole talk on its own
– It’s very difficult to not overwhelm clusters if you do a lot of trimming at once
Deleting Snapshots (OSD)
![Page 69: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/69.jpg)
RADOS: Alternate Approaches
![Page 70: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/70.jpg)
70
● Maintain a per snapid directory with hard links!
– Every clone is linked into (all) its snapid directory(s)
– Just list the directory to identify them, then
● update the object’s SnapSet● Unlink from all relevant directories
● Turns out this destroys locality, in addition to being icky code
Past: Deleting Snapshots
![Page 71: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/71.jpg)
71
● For instance, LVM snapshots?
● We don’t want to snapshot everything on an OSD at once
– No implicit “consistency groups” across RBD volumes, for instance
● So we ultimately need a snap→object mapping, since each snap touches so few objects
Present: Why per-object?
![Page 72: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/72.jpg)
72
● Update internal interfaces for more coalescing
– There’s no architectural reason we need to scan each object per-snapshot
– Instead, maintain iterators for each snapshot we are still purging and advance them through the keyspace in step so we can do all snapshots of a particular object in one go
● Change deleted snapshots representation so it doesn’t inflate ODSMaps
– “deleting_snapshots” instead, which can be trimmed once all OSDs report they’ve been removed
– Store the full list of deleted snapshots in config-keys or similar, handle it with ceph-mgr
Future: Available enhancements
![Page 73: Ceph Snapshots: Diving into Deep Waters · Ceph Projects RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library](https://reader033.vdocuments.mx/reader033/viewer/2022060521/605060a3fe5d6705f514cb34/html5/thumbnails/73.jpg)
73
● BlueStore: It makes everything better
– Stop having to map our semantics onto a filesystem
– Snapshot deletes still require the snapid→object mapping, but the actual delete is a few key updates rolled into RocksDB – easy to coalesce
● Better throttling for users
– Short-term: hacks to enable sleeps so we don’t overwhelm the local FS
– Long-term: proper cost estimates for BlueStore that we can budget correctly (not really possible in Fses since they don’t expose number of IOs needed to flush current dirty state)
Future: Available enhancements