rangoli: space management in deduped environments · rangoli: space management in deduped...
TRANSCRIPT
P.C. Nagesh and Atish Kathpal
Advanced Technology Group,
NetApp, India
Rangoli: Space management in deduped
environments
1
Outline
What is the space management problem ?
Intuition behind our solutions
Evaluation and Summary
2
Space management Objectives
3
Cluster architecture depiction from OpenStack
Low Free
Space
Ensure adequate free space on volumes
Back end volumes as data containers
Space management problem
4
Logical View Physical View Volume metadata
Data
block
Ref
count
6
9
10
12
Freeing up a deduped volume is hard!
Illustrative example
How do you reclaim 50 GB free space?
Logical View
Which files to
move?
Space
Reclamation
1 3
2
4 5
6
Which files to
move?
Space
Reclamation
6 10
Which files to
move?
Space
Reclamation
6 10
2,4 10
Which files to
move?
Space
Reclamation
6 10
2,4 10
1,2,3 21
Outline
What is the space management problem ?
Intuition behind our solutions
Evaluation and Summary
6
Intuitive solutions and alternatives
7
1 3
2
4 5
6
Dedupe unaware
strategy
Low space reclamation,
too many unnecessary
side effects
Naïve- du :
Migrate files with
most unique
content
Intutive solution
Move shared files
Together.
Pick good “migration bins”
Side effects
Physical Space bloat
Due to loss of disk sharing
Percentage increase in physical space
consumption
Migration utility
Bandwidth wastage
Amount of reclaimed space per 100 bytes of data
transfer
It is a source centric strategy
8
Rangoli: Solution overview
1. Compute disk sharing relationships
– Graphical representation
2. Identify groups of highly shared files
– Good migration bins
3. Compute and report the exact metrics
– PSB and Migration Utility
Output the best migration bins (combined
with any higher level logic) 9 NetApp Confidential - Internal Use Only
1,2,3
6 4,5
Fingerprint Database
Inode FBN Fingerprint
1 3 a23b1234
2 5 234c1234
Outline
What is the space management problem ?
Intuition behind our solutions
Evaluation and Summary
10
Evaluation
Evaluation objectives
Comparison against alternate strategies
Datasets from diverse workloads
VM images
Home directories
Engineering document repositories
11 NetApp Confidential - Internal Use Only
Migration Utility (VMDK dataset)
12 NetApp Confidential - Internal Use Only
Higher the better
– More space reclamation per unit of data
migration
0
10
20
30
40
50
60
70
80
90
100
1 5 10 20
Mig
rati
on
Uti
lity
(%
)
Space Reclamation (%)
Naïve-du
MinHash
Rangoli
Physical space bloat (Debian dataset)
13 NetApp Confidential - Internal Use Only
Lower the better
– Less percentage increase in physical space
consumption
0
5
10
15
20
25
30
35
40
5 10 20 30
PS
B (
%)
Space Reclamation (%)
Naïve-du
MinHash
Rangoli
Summary
Inferences:
– Rangoli offers a scalable solution for space
reclamation in deduped environments
Better than alternatives by upto 35X sometimes.
Future work:
– Explore destination aware strategies
– Combine space reclamation with other desired
features such as load balancing and
performance considerations
14
Acknowledgements
Our thanks to Gaurav Makkar, Kaladhar
Voruganti, Kiran Srinivasan, Parag Deshmukh
and our anonymous reviewers for the several
insights and valuable feedback received
This work was done as part of Independent
Research Project at Advanced Technology
Group, NetApp at Bangalore, India.
15
16
Scalability
28 minutes of end to end running time for the
largest dataset tested
– Synthetic dataset of 4TB size with 12million
files and 85% dedupe.
– Running times on a Laptop grade machine
17 NetApp Confidential - Internal Use Only
18
Solution details
Step 1: FPDB Processing
Algorithm :
– A linear scan of the fingerprint
sorted FPDB
– Output is the bipartite graph
– Time is linearly proportional to the
data set size Inode FBN Fingerprint
1 3 a23b12349870
2 5 a23b12349870
1
2
3
4 3
3
5
File Id
Amount of
disk sharing
5
6
4
Step 2: Migration binning
Algorithm:
– Seek partitions with minimal edge
cuts
– Offers good but not necessarily
optimal partitions
– Time is dependent on the
complexity of disk sharing in the
dataset
given by the number of edges
– Quick union find with weighted
path compression data
structure for bin management
1
2
3
4 3
3
5
File Id
Amount of
disk sharing
5
6
4 1
2
1
2
3
4
5
6
Step 3: Compute the metrics
Algorithm:
– Disk sharing within a bin
contributes to savings in data
migration
– Data sharing across migration
bins contributes to losses
We can compute the metrics for
any arbitrary bins too.
The metrics computed are actuals,
not estimates
1 3
2
4 5
6
Datasets and experimental details
Dataset Size No of
Files
Dedup
Home Directory 74 GB 78 K 49 %
Debian 261 GB 448 59 %
VMDK 2.4 TB 2.4 K 62 %
EngWebBurt 1.3 TB 4 M 51 %
Synthetic 1 2.6 TB 8 M 77%
Synthetic 2 4 TB 12 M 85 %
22
Real world datasets from different workloads
Synthetic datasets for testing scalability
Physical space bloat
23
Migration Utility
24
Scalability
Dataset Size No of
Files
Dedup Total running
time in parallel
mode
Home Directory 74 GB 78 K 49 % 24 sec
Debian 261 GB 448 59 % 1 min
VMDK 2.4 TB 2.4 K 62 % 13 min
EngWebBurt 1.3 TB 4 M 51 % 9 min
Synthetic 1 2.6 TB 8 M 77% 20 min
Synthetic 2 4 TB 12 M 85 % 28 min
25
Scales to large datasets
– ~30 minutes for 4TB and 12M file dataset with 85%
dedupe