Download - Ceph Day Nov 2012 - Sage Weil
![Page 1: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/1.jpg)
a unified distributed storage system
sage weilceph day – november 2, 2012
![Page 2: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/2.jpg)
outline
● why you should care● what is it, what it's for● how it works
● architecture
● how you can use it● librados● radosgw● RBD, the ceph block device● distributed file system
● roadmap● why we do this, who we are
![Page 3: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/3.jpg)
why should you care about anotherstorage system?
![Page 4: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/4.jpg)
requirements
● diverse storage needs● object storage● block devices (for VMs) with snapshots, cloning● shared file system with POSIX, coherent caches● structured data... files, block devices, or objects?
● scale● terabytes, petabytes, exabytes● heterogeneous hardware● reliability and fault tolerance
![Page 5: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/5.jpg)
time
● ease of administration● no manual data migration, load balancing● painless scaling
● expansion and contraction● seamless migration
![Page 6: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/6.jpg)
cost
● linear function of size or performance● incremental expansion
● no fork-lift upgrades
● no vendor lock-in● choice of hardware● choice of software
● open
![Page 7: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/7.jpg)
what is ceph?
![Page 8: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/8.jpg)
unified storage system
● objects● native● RESTful
● block● thin provisioning, snapshots, cloning
● file● strong consistency, snapshots
![Page 9: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/9.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 10: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/10.jpg)
distributed storage system
● data center scale● 10s to 10,000s of machines● terabytes to exabytes
● fault tolerant● no single point of failure● commodity hardware
● self-managing, self-healing
![Page 11: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/11.jpg)
ceph object model
● pools● 1s to 100s● independent namespaces or object collections● replication level, placement policy
● objects● bazillions● blob of data (bytes to gigabytes)● attributes (e.g., “version=12”; bytes to kilobytes)● key/value bundle (bytes to gigabytes)
![Page 12: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/12.jpg)
why start with objects?
● more useful than (disk) blocks● names in a single flat namespace● variable size● simple API with rich semantics
● more scalable than files● no hard-to-distribute hierarchy● update semantics do not span objects● workload is trivially parallel
![Page 13: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/13.jpg)
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
![Page 14: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/14.jpg)
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
![Page 15: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/15.jpg)
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN (actually more like this…)
(COMPUTER)(COMPUTER)
![Page 16: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/16.jpg)
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
![Page 17: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/17.jpg)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
![Page 18: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/18.jpg)
Monitors:
• Maintain cluster membership and state
• Provide consensus for distributed decision-making
• Small, odd number
• These do not serve stored objects to clients
M
Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform
replication tasks
![Page 19: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/19.jpg)
M
M
M
HUMAN
![Page 20: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/20.jpg)
data distribution
● all objects are replicated N times● objects are automatically placed, balanced, migrated
in a dynamic cluster● must consider physical infrastructure
● ceph-osds on hosts in racks in rows in data centers
● three approaches● pick a spot; remember where you put it● pick a spot; write down where you put it● calculate where to put it, where to find it
![Page 21: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/21.jpg)
CRUSH• Pseudo-random placement
algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Ensures even distribution
• Stable mapping
• Limited data migration
• Rule-based configuration
• specifiable replication
• infrastructure topology aware
• allows weighting
![Page 22: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/22.jpg)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
hash(object name) % num pg
CRUSH(pg, cluster state, policy)
![Page 23: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/23.jpg)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
![Page 24: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/24.jpg)
RADOS
● monitors publish osd map that describes cluster state● ceph-osd node status (up/down, weight, IP)● CRUSH function specifying desired data distribution
● object storage daemons (OSDs)● safely replicate and store object● migrate data as the cluster changes over time● coordinate based on shared view of reality
● decentralized, distributed approach allows● massive scales (10,000s of servers or more)● the illusion of a single copy with consistent behavior
M
![Page 25: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/25.jpg)
CLIENTCLIENT
??
![Page 26: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/26.jpg)
![Page 27: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/27.jpg)
![Page 28: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/28.jpg)
CLIENT
??
![Page 29: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/29.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 30: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/30.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
APPAPP
native
![Page 31: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/31.jpg)
LLLIBRADOS
• Provides direct access to RADOS for applications
• C, C++, Python, PHP, Java• No HTTP overhead
![Page 32: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/32.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 33: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/33.jpg)
MM
MM
MM
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
native
REST
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
![Page 34: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/34.jpg)
RADOS Gateway:• REST-based interface to
RADOS• Supports buckets,
accounting• Compatible with S3 and
Swift applications
![Page 35: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/35.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
![Page 36: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/36.jpg)
DISKDISK
COMPUTERCOMPUTER
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
DISKDISK
![Page 37: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/37.jpg)
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
VMVM
VMVM
VMVM
![Page 38: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/38.jpg)
MM
MM
MM
VMVM
LIBRADOSLIBRADOS
LIBRBDLIBRBD
VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER
![Page 39: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/39.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
LIBRBDLIBRBD
CONTAINERCONTAINER
LIBRADOSLIBRADOS
LIBRBDLIBRBD
CONTAINERCONTAINERVMVM
![Page 40: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/40.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)
HOSTHOST
![Page 41: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/41.jpg)
RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers
• Live migration!• Images are striped across the cluster• Snapshots!• Support in
• Qemu/KVM
• OpenStack, CloudStack
• Mainline Linux kernel
![Page 42: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/42.jpg)
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
![Page 43: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/43.jpg)
144 0 0 0 0 = 144
instant copy
![Page 44: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/44.jpg)
4144
CLIENT
write
write
write
= 148
write
![Page 45: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/45.jpg)
4144
CLIENTread
read
read
= 148
![Page 46: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/46.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 47: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/47.jpg)
MM
MM
MM
CLIENTCLIENT
0110
0110
datametadata
![Page 48: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/48.jpg)
MM
MM
MM
![Page 49: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/49.jpg)
Metadata Server• Manages metadata for a
POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to
clients• Only required for shared
filesystem
![Page 50: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/50.jpg)
one tree
three metadata servers
??
![Page 51: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/51.jpg)
![Page 52: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/52.jpg)
![Page 53: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/53.jpg)
![Page 54: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/54.jpg)
![Page 55: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/55.jpg)
DYNAMIC SUBTREE PARTITIONING
![Page 56: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/56.jpg)
recursive accounting
● ceph-mds tracks recursive directory stats● file sizes ● file and directory counts● modification time
● virtual xattrs present full stats● efficient
$ ls alSh | headtotal 0drwxrxrx 1 root root 9.7T 20110204 15:51 .drwxrxrx 1 root root 9.7T 20101216 15:06 ..drwxrxrx 1 pomceph pg4194980 9.6T 20110224 08:25 pomcephdrwxrxrx 1 mcg_test1 pg2419992 23G 20110202 08:57 mcg_test1drwxx 1 luko adm 19G 20110121 12:17 lukodrwxx 1 eest adm 14G 20110204 16:29 eestdrwxrxrx 1 mcg_test2 pg2419992 3.0G 20110202 09:34 mcg_test2drwxx 1 fuzyceph adm 1.5G 20110118 10:46 fuzycephdrwxrxrx 1 dallasceph pg275 596M 20110114 10:06 dallasceph
![Page 57: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/57.jpg)
snapshots
● volume or subvolume snapshots unusable at petabyte scale● snapshot arbitrary subdirectories
● simple interface● hidden '.snap' directory● no special tools
$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
![Page 58: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/58.jpg)
multiple protocols, implementations
● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt● export (NFS), Samba (CIFS)
● ceph-fuse● libcephfs.so
● your app● Samba (CIFS)● Ganesha (NFS)● Hadoop (map/reduce) kernel
libcephfs
ceph fuseceph-fuse
your app
libcephfsSamba
libcephfsGanesha
NFS SMB/CIFS
libcephfsHadoop
![Page 59: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/59.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
![Page 60: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/60.jpg)
current status
● argonaut stable release v0.48● rados, RBD, radosgw
● bobtail stable release v0.55● RBD cloning● improved performance, scaling, failure behavior● radosgw API, performance improvements● freeze in ~1 week, release in ~4 weeks
![Page 61: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/61.jpg)
roadmap
● file system● pivot in engineering focus● CIFS (Samba), NFS (Ganesha), Hadoop
● RBD● Xen integration, iSCSI
● radosgw● Keystone integration
● RADOS● geo-replication● PG split
![Page 62: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/62.jpg)
why we do this
● limited options for scalable open source storage ● proprietary solutions
● expensive● don't scale (well or out)● marry hardware and software
● users hungry for alternatives● scalability● cost● features
![Page 63: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/63.jpg)
two fields
● green: cloud, big data● incumbents don't have a viable solution● most players can't afford to build their own● strong demand for open source solutions
● brown: traditional SAN, NAS; enterprise● incumbents struggle to scale out● can't compete on price with open solutions
![Page 64: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/64.jpg)
licensing
● <yawn>● promote adoption● enable community development● prevent ceph from becoming proprietary● allow organic commercialization
![Page 65: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/65.jpg)
ceph license
● LGPLv2● “copyleft”
– free distribution– allow derivative works– changes you distribute/sell must be shared
● ok to link to proprietary code– allow proprietary products to incude and build on ceph– does not allow proprietary derivatives of ceph
![Page 66: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/66.jpg)
fragmented copyright
● we do not require copyright assignment from contributors● no single person or entity owns all of ceph● no single entity can make ceph proprietary
● strong community● many players make ceph a safe technology bet● project can outlive any single business
![Page 67: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/67.jpg)
why its important
● ceph is an ingredient● we need to play nice in a larger ecosystem● community will be key to ceph's success
● truly open source solutions are disruptive● open is a competitive advantage
– frictionless integration with projects, platforms, tools– freedom to innovate on protocols– leverage community testing, development resources– open collaboration is efficient way to build technology
![Page 68: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/68.jpg)
who we are
● Ceph created at UC Santa Cruz (2004-2007)● supported by DreamHost (2008-2011)● Inktank (2012)
● Los Angeles, Sunnyvale, San Francisco, remote
● growing user and developer community● Linux distros, users, cloud stacks, SIs, OEMs
http://ceph.com/
![Page 71: Ceph Day Nov 2012 - Sage Weil](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54809b0db4795941578b46dd/html5/thumbnails/71.jpg)
why we like btrfs
● pervasive checksumming● snapshots, copy-on-write● efficient metadata (xattrs)● inline data for small files● transparent compression● integrated volume management
● software RAID, mirroring, error recovery● SSD-aware
● online fsck● active development community