global deduplication for ceph
TRANSCRIPT
GlobaldeduplicationforCeph
Myoungwon Oh
SW-DefinedStorageLab
SKTelecom
Agenda
1. WhydoweneedGlobaldedup ?
2. Ceph deduplicationdesign
3. Ceph extensibletier(implementation)
4. Upstream
5. Plan&issues
5G
FlashdeviceHighPerformance,LowLatency,SLA
UHD4K
Scalable, Available, Reliable, Unified Interface, Open Platform
High Performance, Low Latency
All-flashCeph !
Contribution : QoS, Deduplication, etc.
Storage Solution for- Private Cloud for Developers- Virtual Desktop Infra
SKwithsoftware-definedstorage
Whydoweneedglobaldedup?
A B C D
Data A Data A
A B C D
Data A Data A
1)Localdeduplication 2)Global deduplication
4 OSD 8 OSD 12 OSD 16 OSD
Local Dedup 15.5% 8.1% 5.5% 4.1%
Global Dedup 50% 50% 50% 50%
B)FIOworkloadwithdeduplicationratioof50%(32KBblocksize)
A)Designcomparison
• Upto40%oftotalstoragespacecanbesavedviadeduplication(inourprivatecloud)• Localdedup (inablockdevicelevel)cannotcoverwholedatareductionintermsof
Cluster-wide
Designchallenges
• Whichimplementationisthemostappropriateforshared-nothingscale-outstorage?§ Applicabletoexistingsource
§ Transparenttotheapplication
§ Efficientmetadatamanagement
• Howtomanagededup metadata?
• Whatisthemostappropriatededup method(e.g.,inlineorpost)?§ Performance
§ I/Ocost
Design1:Doubledistributionhash
• DoweneedanewMDS(metadataserver)fordedup?§ Shared-nothingfilesystemisscalablebecausethereisnoMDS.
§ MDSdoesnotfittheShared-nothingdesign
§ MDSneedsadditionalI/Os tocompleteI/Orequests
§ E.g.,metadataquerytoMDS
§ HowtodorebalanceifweaddMDS?
§ SynchronizationbetweenMDSs
Design1:Doubledistributionhash
• Canweimplementdedup withoutMDS?
§ Yes,CAS(content-addressablestorage)poolwithdoubledistributionhash!§ (OID,Data)– chunkingandfingerprinting->(OID,Offests[],FPs[])- FPisnewOID->(FP,Data)
§ Pros
§ NocentralMDS
§ Applicableto existingsource withoutmajormodifications.
§ Transparenttotheapplication
§ Efficientmetadatamanagement
§ Reusingexistingarchitecture
§ DH,recovery,rebalance,dataplacement
§ Cons
§ I/Oredirection(needatranslationlayer)
Design2:Self-containedobjectfordeduplication
• Externalmetadatastructureneedsadditionalcomplexlinkingbetween
deduplicationmetadataandexistingscale-outstoragesystem
• Self-containedobjectcanbeanswer
§ Dedup metadataisincludedintheoriginalobject
MetadataPool
Chunk Pool
DataClient
BaseTier
Dedup Tier
ObjectfooObjectSize=4MBChunkSize=1MB
Objects(Chunkedobject)OID:fxc039 ,Referencecount:1OID:Dxc045,Referencecount:2OID:fZc0y9,Referencecount:4
foo-object(hasmanifest){0– 32K,fxc03932– 64K,Dxc04564– 128K,fZc0y9}
Post-processing
1. FinddirtymetadataobjectwhichcontainsdirtychunksfromthedirtyobjectIDlist.
2. FindthedirtychunkIDfromthedirtymetadataobject'schunkmap.
3. Thededuplicationenginegenerateachunkobjectandsendittothechunkpool.
4. Inthechunkpool,thechunkobjectgeneratedinstep3isplaced
5. Addreferencecountinformationtotheobject.
6. Whenthechunkwriteatthechunkpoolends,updatethemetadataobject'schunkmap.
Asinglechunk
Implementation:Extensibletier
• Thekeystructureforextensibletierstruct object_manifest_t {
enum {TYPE_NONE=0,TYPE_REDIRECT=1,TYPE_CHUNKED=2,TYPE_DEDUP=3,
};uint8_ttype;//redirect,chunked,...ghobject_t redirect_target;map<uint64_t(offset),chunk_info_t>
chunk_map;};
• Operations§ Proxyread,write§ Flush,promote
object_info_t
obj
Implementation:writepath
RBDClient
1. Get dirty chunklist2. while(Chunklist.dirty_chunks())
{ if (has_old_reference)
decrement old chunk’s reference objecter->write(Chunklist[i])i++;
}3. Receive all of the Ack and update chunk’ state (dirty à clean)
Write(foo, offset, size)
if check eviction limit (all chunks are dirty)Write request is blocked until dirty object is flushed
elseHandle a write request
Base tier (post-processing)
Write chunk dataIncrement reference count
Lower Tier
If chunk objectset_chunk (source, target)
else if redirect objectset_redirect(source, target)
Handle set_chunk or set_redirect
Base tier (set-redirect or set-chunk)
1
2
3
1. Chunking and write the object2. Update chunk_map (clean à dirty)
HandleawriterequestEvictionlimit(chunkcase)
44
Upstream
• Proposal§ http://marc.info/?l=ceph-devel&m=148172886923985&w=2
• Design§ http://marc.info/?l=ceph-devel&m=148646542200947&w=2§ Pad document (with Sage Weil)
• http://pad.ceph.com/p/deduplication_how_dedup_manifists• http://pad.ceph.com/p/deduplication_how_do_we_store_chunk• http://pad.ceph.com/p/deduplication_how_do_we_chunk• http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process
• Progress§ osd,librados: add manifest, redirect https://github.com/ceph/ceph/pull/14894§ osd,librados: add manifest, operations for chunked object
https://github.com/ceph/ceph/pull/15482§ osd: flush operations for chunked objects https://github.com/ceph/ceph/pull/19294§ osd, librados: add a rados op (TIER_PROMOTE) https://github.com/ceph/ceph/pull/19362§ WIP: osd: refcount for manifest object (redirect, chunked)
https://github.com/ceph/ceph/pull/19935
Plan&Issues
• Plan (To Do)§ Reference counting methods and data types for redirect and chunk§ Offline fingerprinting and then storing of dedup chunked manifest (whole object or parts of it)§ Dedup processing
§ Background dedup worker§ Refcount manager and methods for dedup (http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)
§ Fixed-sized backpointers§ Scrub
§ Test cases
§ Issues§ Small chunk (< 64 KB)§ Minimizing performance degradation§ Dedup methods (inline?)§ CDC (contents defined chunking)