![Page 1: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/1.jpg)
CSCI 333 – Spring 2020Williams College
Deduplication: Overview & Case Studies
![Page 2: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/2.jpg)
Lecture Outline
Content Addressable Storage (CAS)
DeduplicationChunking
The Index
Background
Other CAS applications
![Page 3: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/3.jpg)
Lecture Outline
Content Addressable Storage (CAS)
DeduplicationChunking
The Index
Background
Other CAS applications
![Page 4: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/4.jpg)
Content Addressable Storage (CAS)
Deduplication systems often rely on Content Addressable Storage (CAS)
Data is indexed by some content identifier
The content identifier is determined by some function over the data itself - often a cryptographically strong hash function
![Page 5: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/5.jpg)
CAS
Example:I send a document to be stored remotely on some content addressable storage
![Page 6: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/6.jpg)
CAS
Example:The server receives the document, and calculates a unique identifier called the data's fingerprint
![Page 7: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/7.jpg)
CAS
The fingerprint should be:
unique to the data- NO collisions
one-way- hard to invert
![Page 8: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/8.jpg)
CAS
The fingerprint should be:
SHA-1:
20 bytes (160 bits)
P(collision(a,b)) = (½)160
coll(N, 2160) = (NC
2)(½)160
unique to the data- NO collisions
one-way- hard to invert 1024 objects before it is more likely
than not that a collision has occurred
![Page 9: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/9.jpg)
CAS
Example:SHA-1( ) = de9f2c7fd25e1b3a...
Name de9f2c7fd25e1b3a... de9f2c7fd25e1b3a... data
homework.txt
![Page 10: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/10.jpg)
CAS
Example:I submit my homework, and my “buddy” Harold also submits my homework...
![Page 11: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/11.jpg)
CAS
Example:Same contents, same fingerprint.
de9f2c7fd25e1b3a...
de9f2c7fd25e1b3a...
de9f2c7fd25e1b3a... data
![Page 12: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/12.jpg)
CAS
Example:Same contents, same fingerprint.
The data is only stored once!
de9f2c7fd25e1b3a...
de9f2c7fd25e1b3a...
de9f2c7fd25e1b3a... data
![Page 13: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/13.jpg)
Background
Content Addressable Storage (CAS)
DeduplicationChunking
The Index
Background
Other applications
![Page 14: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/14.jpg)
CAS
Example:Now suppose Harry writes his name at the top of my document.
![Page 15: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/15.jpg)
CAS
Example:The fingerprints are completely different, despite the (mostly) identical contents.
de9f2c7fd25e1b3a...
fad3e85a0bd17d9b...
de9f2c7fd25e1b3a... datafad3e85a 0bd17d9b... data'
![Page 16: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/16.jpg)
CAS
Problem Statement:
What is the appropriate granularity to address our data?
What are the tradeoffs associated with this choice?
![Page 17: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/17.jpg)
Background
Content Addressable Storage (CAS)
DeduplicationChunking
The Index
Background
Other applications
![Page 18: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/18.jpg)
Deduplication
Chunking breaks a data stream into segments
DATASHA1( )
How do we divide a data stream?
How do we reassemble a data stream?
CK1 CK2 CK3SHA1( SHA1( SHA1() + )) +
becomes
![Page 19: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/19.jpg)
Deduplication
Division.
Option 1: fixed-size blocks
- Every (?)KB, start a new chunk
Option 2: variable-size chunks
- Chunk boundaries dependent on chunk contents
![Page 20: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/20.jpg)
Deduplication
Division: fixed-size blocks
hw-bill.txt hw-harold.txt
=
=
=
=
=
![Page 21: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/21.jpg)
Deduplication
Division: fixed-size blocks
hw-bill.txt hw-harold.txt
=|=
=|=
=|=
=|=
=|=
=|=
Suppose Harold adds his name to the top of my homework
This is called the boundary shifting problem.
Harold
![Page 22: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/22.jpg)
Deduplication
Division.
Option 1: fixed-size blocks
- Every 4KB, start a new chunk
Option 2: variable-size chunks
- Chunk boundaries dependent on chunk contents
![Page 23: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/23.jpg)
Deduplication
Division: variable-size chunks
Window of width wTarget pattern t
parameters:- Slide the window byte by byte across the data, and
compute a window fingerprint at each position.
- If the fingerprint matches the target, t, then we have a fingerprint match at that position
![Page 24: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/24.jpg)
Deduplication
Division: variable-size chunks
- Slide the window byte by byte across the data, and compute a window fingerprint at each position.
- If the fingerprint matches the target, t, then we have a fingerprint match at that position
![Page 25: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/25.jpg)
Deduplication
Division: variable-size chunks
hw-wkj.txt hw-harold.txt
![Page 26: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/26.jpg)
Deduplication
Division: variable-size chunks
hw-wkj.txt hw-harold.txt
=|=
Suppose Harold adds his name to the top of my homework
Only introduce one new chunk to storage.
Harold
![Page 27: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/27.jpg)
Deduplication
Division: variable-size chunks
Sliding window properties:
- collisions are OK, but- average chunk size should be configurable
- reuse overlapping window calculations
Rabin fingerprints
Window w, target t- expect a chunk ever 2t-1+w bytes
LBFS: w=48, t=13- expect a chunk every 8KB
![Page 28: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/28.jpg)
Deduplication
Division: variable-size chunks
Rabin fingerprint: preselect divisor D, and an irreducible polynomial
R(bi,...,b
i+w-1) = ((R(b
i-1, ..., b
i+w-2) - b
i-1pw-1)p + b
i+w-1) mod D
R(b1,b
2,...,b
w) = (b
1pw-1 + b
2pw-2 + … + b
w) mod D
Arbitrarywindow
of width w
previous window
calculation
previousfirstterm
![Page 29: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/29.jpg)
Deduplication
Recap:
Chunking breaks a data stream into smaller segments
→ What do we gain from chunking?
→ What are the tradeoffs?
+ Finer granularity of sharing
+ Finer granularity of addressing
- Fingerprinting is an expensive operation
- Not suitable for all data patterns
- Index overhead
![Page 30: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/30.jpg)
Deduplication
Reassembling chunks:
Recipes provide directions for reconstructing files from chunks
![Page 31: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/31.jpg)
Metadata<SHA1><SHA1><SHA1>
...
Deduplication
Recipes provide directions for reconstructing files from chunks
DATABLOCK
DATABLOCK
DATABLOCK
Reassembling chunks:
![Page 32: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/32.jpg)
CAS
Example:
Name de9f2c7fd25e1b3a... de9f2c7fd25e1b3a... recipe/data
homework.txt
Metadata<SHA1><SHA1><SHA1>
...
???( )
![Page 33: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/33.jpg)
Deduplication
Content Addressable Storage (CAS)
DeduplicationChunking
The Index
Background
Other applications
![Page 34: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/34.jpg)
Deduplication
SHA-1 fingerprint uniquely identifies data, but the index translates fingerprints to chunks.
The Index:
<sha-11> <chunk
1>
<sha-12> <chunk
2>
<sha-13> <chunk
3>
… …<sha-1
n> <chunk
n>
<chunki> = {location, size?, refcount?, compressed?, ...}
![Page 35: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/35.jpg)
Deduplication
For small chunk stores:- database, hash table, tree
For a large index, legacy data structures won't fit in main memory- each index query requires a disk seek
- why?SHA-1 fingerprints independent and randomly distributed
- no locality
The Index:
Known as the index disk bottleneck
![Page 36: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/36.jpg)
Deduplication
Back of the envelope:
Average chunk size: 4KBFingerprint: 20B
20TB unique data = 100GB SHA-1 fingerprints
The Index:
![Page 37: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/37.jpg)
Deduplication
Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessary
Disk bottleneck:
Summary Vector
Stream Informed SegmentLayout (Containers)
Locality Preserving CacheMemory
Disk
![Page 38: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/38.jpg)
Deduplication
Summary vector
- Bloom filter (any AMQ data structure works)
Disk bottleneck:
Filter properties:● No false negatives
● if an FP is in the index, it is in summary vector● Tuneable false positive rate
● We can trade memory for accuracy
1 0 0 1 0 1 0 1 1 1 0 11 0 1 1 ......
h1
h2
h3
Note: on a false positive, we are no worse off - We just do the disk seek we would have done anyway
![Page 39: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/39.jpg)
Deduplication
Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessary
Disk bottleneck:
Summary Vector
Stream Informed SegmentLayout (Containers)
Locality Preserving CacheMemory
Disk
Bloom Filter
![Page 40: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/40.jpg)
Deduplication
Stream informed segment layout (SISL)- variable sized chunks written to fixed size containers- chunk descriptors are stored in a list at the head
→“temporal locality” for hashes within a container
Disk bottleneck:
Principle:
- backup workloads exhibit chunk locality
![Page 41: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/41.jpg)
Deduplication
Data Domain strategy:- filter unnecessary lookups- piggyback useful work onto the disk lookups that are necessary
Disk bottleneck:
Summary Vector
Stream Informed SegmentLayout (Containers)
Locality Preserving CacheMemory
Disk
Group Fingerprints:Temporal Locality
Bloom Filter
![Page 42: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/42.jpg)
Deduplication
Locality Preserving Cache (LPC)
- LRU cache of candidate fingerprint groups
Disk bottleneck:
Principle:
- if you must go to disk, make it worth your while
CD1
CD2
CD3
CD4
CD43
CD44
CD45
CD46
CD9
CD10
CD11
CD12
...
...
On-disk container
![Page 43: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/43.jpg)
Deduplication
Disk bottleneck:
Fingerprint in Bloom
filter?
No LookupNecessary
Fingerprintin LPC?
On-disk fingerprintindex lookup: getcontainer location
Prefetch fingerprintsfrom head of target
data container.
Read data fromtarget container.END
START
Read requestfor chunk
fingerprint
No
Yes
No
Yes
![Page 44: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/44.jpg)
Deduplication
Dedup Goal: eliminate repeat instances of identical data
What (granularity) to dedup?
Where to dedup?
When to dedup?
Why dedup?
Summary: Dedup and the 4 W's
![Page 45: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/45.jpg)
Deduplication
What (granularity) to dedup?
Summary: Dedup and the 4 W's
Whole-file Fixed-size Content-defined
Chunking overheads
N/A offsets Sliding window fingerprinting
DedupRatio
All-or-nothing Boundary shifting problem
Best
Other notes
Low index overhead,compressed/encrypted/media
(Whole-file)+
Ease of implementation, selective caching, synchronization
Latency, CPU intensive
Hybrid?Context-aware.
![Page 46: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/46.jpg)
Deduplication
Where to dedup?
Summary: Dedup and the 4 W's
source destination
Dedup before sendingdata over the network
+ save bandwidth- client complexity- trust clients?
Dedup at storage server+ server more powerful- centralized data structures
Client index checks membership, Server index stores location
hybrid
![Page 47: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/47.jpg)
Deduplication
When to dedup?
Summary: Dedup and the 4 W's
post-process
hybrid
inline
Data Dedup Disk Data Disk
Dedup
→ post-processing faster for initial commits→ switch to inline to take advantage of I/O savings
+ never store duplicate data- slower → index lookup per chunk+ faster → save I/O for duplicate data
- temporarily wasted storage+ faster → stream long writes, reclaim in
the background- may create (even more) fragmentation
![Page 48: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/48.jpg)
Deduplication
Perhaps you have a loooooot of data...
- enterprise backups
Or data that is particularly amenable to deduplication...
- small or incremental changes
- data that is not encrypted or compressed
Or that changes infrequently.
- blocks are immutable → no such thing as a “block modify”
- rate of change determines container chunk locality
Why dedup?
Ideal use case: “Cold Storage”
![Page 49: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/49.jpg)
Deduplication
Perhaps your bottleneck isn't the CPU
- Use dedup if you can favorably trade other resources
Why dedup?
SharedCache
SharedCache
Packet Store(FIFO)
Packet Store(FIFO)
FingerprintIndex
FingerprintIndex
Bandwidth ConstrainedLink
Example: Protocol Independent Technique for EliminatingRedundant Network Traffic
![Page 50: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/50.jpg)
Background
Content Addressable Storage (CAS)
DeduplicationChunking
The Index
Background
Other applications
![Page 51: Deduplication: Overview & Case Studies · Chunking overheads N/A offsets Sliding window fingerprinting Dedup Ratio All-or-nothing Boundary shifting problem Best Other notes Low index](https://reader033.vdocuments.mx/reader033/viewer/2022042712/5f8afdb01a2e9c7e0c3aad8e/html5/thumbnails/51.jpg)
Other CAS Applications
Insight: Fingerprints uniquely identify data
- hash before storing data, and save the fp locally- rehash data and compare fps upon receipt
Data verification
CAS can be used to build tamper evident storage. Suppose that:
- you can't fix a compromised server,
- but you never want be fooled by one
!
!?!?!?!