a study of practical deduplication · 2020. 5. 19. · the benefit of fine grained dedup is < 20%...
TRANSCRIPT
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
A study of practical deduplication
Dutch T. Meyer University of British Columbia
Microsoft Research Intern William Bolosky Microsoft Research
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Why Dutch is Not Here
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
A study of practical deduplication
Dutch T. Meyer University of British Columbia
Microsoft Research Intern William Bolosky Microsoft Research
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Why study deduplication?
$0.046 per GB
9ms per seek
$0.039 XXXX
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
When do we exploit duplicates?
It Depends. How much can you get back from deduping? How does fragmenting files affect performance? How often will you access the data?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Outline
Intro Methodology “There’s more here than dedup” teaser
(intermission)
Deduplication Background Deplication Analysis Conclusion
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Methodology
MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data)
Once per week for 4 weeks. ~875 file systems ~40TB ~200M Files
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
There’s more here than dedup!
We update and extend filesystem metadata findings from 2000 and 2004
File system complexity is growing Read the paper to answer questions like: Are my files bigger now than they used to be?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Teaser: Histogram of file size
0%
2%
4%
6%
8%
10%
12%
14%
0 8 128 2K 32K 512K 8M 128M File Size (bytes), power-of-two bins
2009 2004 2000
4K Since 1981!
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
There’s more here than dedup!
How fragmented are my files?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Teaser: Layout and Organization
High linearity: only 4% of files fragmented in practice Most windows machines defrag weekly
One quarter of fragmented files have at least 170 fragments
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Intermission
Intro Methodology “There’s more here than dedup” teaser
(intermission)
Deduplication Background Deplication Analysis Conclusion
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Dedup Background
foo 01101010….. ….110010101
bar 01101010….. ….110010101
Whole file Deduplication
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Dedup Background
foo 01101010….. ….110010101
bar 01101010….. ….110010101
Fixed Chunk Deduplication
1
01101010…..
01101010…..
….110010101
….1100101011
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Dedup Background
foo 01101010….. ….110010101
bar 01101010….. ….110010101
Rabin Figerprinting
1
110101 101010 010100
101101010…..
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
The Deduplication Space
Algorithm Parameters
Cost Deduplication effectiveness
Whole-file Low Lowest
Fixed Chunk
Chunk Size Seeks CPU Complexity
Middle
Rabin fingerprints
Average Chunk Size
Seeks More CPU More Complexity
Highest
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
What is the relative deduplication rate of the algorithms?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Dedup by method and chunk size
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
64K 32K 16K 8K
Spac
e D
edup
licat
ed
Chunk Size
Whole File Fixed-Chunk Rabin
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
What if I was doing full weekly backups?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Backup dedup over 4 weeks
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Whole File
Whole File + Sparse
8K rabin
Deduplicated Space
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
How does the number of filesystems influence deduplication?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Dedup by filesystem count
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8 16 32 64 128 256 512 Whole Set
Spac
e D
edup
licat
ed
Deduplication Domain Size (file systems)
Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
So what is filling up all this space?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Bytes by containing file size
0%
2%
4%
6%
8%
10%
12%
1K 16K 256K 4M 64M 1G 16G 256G
Perc
enta
ge o
f Tot
al B
ytes
Containing File Size (Bytes), Power-of-2 bins
2000 2004 2009
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
What types of files take up disk space?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Disk consumption by file type
dll dll ø
pdb vhd
dll exe
pdb
lib pst exe
vhd pch
wma
pdb
mp3
lib
exe
lib
cab
pch
chm
pst
cab
cab
mp3
wma
ø
ø
iso
0%
10%
20%
30%
40%
50%
60%
2000 2004 2009
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Disk consumption by file type
dll dll ø
pdb vhd
dll exe
pdb
lib pst exe
vhd pch
wma
pdb
mp3
lib
exe
lib
cab
pch
chm
pst
cab
cab
mp3
wma
ø
ø
iso
0%
10%
20%
30%
40%
50%
60%
2000 2004 2009
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Which of these types deduplicate well?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Whole-file duplicates
Extension % of Duplicate
Space Mean File
Size (bytes) % of
Total Space
dll 20% 521K 10% lib 11% 1080K 7% pdb 11% 2M 7% 7% 277K 13% exe 6% 572K 4% cab 4% 4M 2% msp 3% 15M 2% msi 3% 5M 1% iso 2% 436M 2% 1% 604K
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
What files make up the 20% difference between whole file dedup and sparse file, as compared
to more aggressive deduplication?
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Where does fine granularity help?
vhd vhd
pch
lib
dll
obj
pdb
pdb
lib
pch
wma
iso
pst
dll
ø
avhd
avhd
wma
mo3
wim
0%
10%
20%
30%
40%
50%
60%
70%
8K Fixed 8K Rabin
Perc
enta
ge o
f diff
eren
ce v
s.
who
le fi
le +
spa
rse
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Last plea to read the whole paper
~4x more results in paper! Real world filesystem analysis is hard Eight machines months in query processing Requires careful simplifying assumptions Requires heavy optimization
-
2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.
Conclusion
The benefit of fine grained dedup is < 20% Potentially just a fraction of that.
Fragmentation is a manageable problem Read the paper for more metadata results
We’re releasing this dataset (when I finish the
anonymization)
A study of practical deduplicationWhy Dutch is Not HereA study of practical deduplicationWhy study deduplication?When do we exploit duplicates?OutlineMethodologyThere’s more here than dedup!Teaser: Histogram of file sizeThere’s more here than dedup!Teaser: Layout and OrganizationIntermissionDedup BackgroundDedup BackgroundDedup BackgroundThe Deduplication SpaceSlide Number 17Dedup by method and chunk sizeSlide Number 19Backup dedup over 4 weeksSlide Number 21Dedup by filesystem countSlide Number 23Bytes by containing file sizeSlide Number 25Disk consumption by file typeDisk consumption by file typeSlide Number 28Whole-file duplicatesSlide Number 30Where does fine granularity help?Last plea to read the whole paperConclusion