rangoli: space management in deduped environments · rangoli: space management in deduped...

P.C. Nagesh and Atish Kathpal

Advanced Technology Group,

NetApp, India

Rangoli: Space management in deduped

environments

1

Outline

What is the space management problem ?

Intuition behind our solutions

Evaluation and Summary

2

Space management Objectives

3

Cluster architecture depiction from OpenStack

Low Free

Space

Ensure adequate free space on volumes

Back end volumes as data containers

Space management problem

4

Logical View Physical View Volume metadata

Data

block

Ref

count

6

9

10

12

Freeing up a deduped volume is hard!

Illustrative example

How do you reclaim 50 GB free space?

Logical View

Which files to

move?

Space

Reclamation

1 3

2

4 5

6

Which files to

move?

Space

Reclamation

6 10

Which files to

move?

Space

Reclamation

6 10

2,4 10

Which files to

move?

Space

Reclamation

6 10

2,4 10

1,2,3 21

Outline




6

Intuitive solutions and alternatives

7

1 3

2

4 5

6

Dedupe unaware

strategy

Low space reclamation,

too many unnecessary

side effects

Naïve- du :

Migrate files with

most unique

content

Intutive solution

Move shared files

Together.

Pick good “migration bins”

Side effects

Physical Space bloat

Due to loss of disk sharing

Percentage increase in physical space

consumption

Migration utility

Bandwidth wastage

Amount of reclaimed space per 100 bytes of data

transfer

It is a source centric strategy

8

Rangoli: Solution overview

1. Compute disk sharing relationships

– Graphical representation

2. Identify groups of highly shared files

– Good migration bins

3. Compute and report the exact metrics

– PSB and Migration Utility

Output the best migration bins (combined

with any higher level logic) 9 NetApp Confidential - Internal Use Only

1,2,3

6 4,5

Fingerprint Database

Inode FBN Fingerprint

1 3 a23b1234

2 5 234c1234

Outline




10

Evaluation

Evaluation objectives

Comparison against alternate strategies

Datasets from diverse workloads

VM images

Home directories

Engineering document repositories

11 NetApp Confidential - Internal Use Only

Migration Utility (VMDK dataset)


Higher the better

– More space reclamation per unit of data

migration

0

10

20

30

40

50

60

70

80

90

100

1 5 10 20

Mig

rati

on

Uti

lity

(%

)

Space Reclamation (%)

Naïve-du

MinHash

Rangoli

Physical space bloat (Debian dataset)


Lower the better

– Less percentage increase in physical space

consumption

0

5

10

15

20

25

30

35

40

5 10 20 30

PS

B (

%)

Space Reclamation (%)

Naïve-du

MinHash

Rangoli

Summary

Inferences:

– Rangoli offers a scalable solution for space

reclamation in deduped environments

Better than alternatives by upto 35X sometimes.

Future work:

– Explore destination aware strategies

– Combine space reclamation with other desired

features such as load balancing and

performance considerations

14

Acknowledgements

Our thanks to Gaurav Makkar, Kaladhar

Voruganti, Kiran Srinivasan, Parag Deshmukh

and our anonymous reviewers for the several

insights and valuable feedback received

This work was done as part of Independent

Research Project at Advanced Technology

Group, NetApp at Bangalore, India.

15

Scalability

28 minutes of end to end running time for the

largest dataset tested

– Synthetic dataset of 4TB size with 12million

files and 85% dedupe.

– Running times on a Laptop grade machine


18

Solution details

Step 1: FPDB Processing

Algorithm :

– A linear scan of the fingerprint

sorted FPDB

– Output is the bipartite graph

– Time is linearly proportional to the

data set size Inode FBN Fingerprint

1 3 a23b12349870

2 5 a23b12349870

1

2

3

4 3

3

5

File Id

Amount of

disk sharing

5

6

4

Step 2: Migration binning

Algorithm:

– Seek partitions with minimal edge

cuts

– Offers good but not necessarily

optimal partitions

– Time is dependent on the

complexity of disk sharing in the

dataset

given by the number of edges

– Quick union find with weighted

path compression data

structure for bin management

1

2

3

4 3

3

5

File Id

Amount of

disk sharing

5

6

4 1

2

1

2

3

4

5

6

Step 3: Compute the metrics

Algorithm:

– Disk sharing within a bin

contributes to savings in data

migration

– Data sharing across migration

bins contributes to losses

We can compute the metrics for

any arbitrary bins too.

The metrics computed are actuals,

not estimates

1 3

2

4 5

6

Datasets and experimental details

Dataset Size No of

Files

Dedup

Home Directory 74 GB 78 K 49 %

Debian 261 GB 448 59 %

VMDK 2.4 TB 2.4 K 62 %

EngWebBurt 1.3 TB 4 M 51 %

Synthetic 1 2.6 TB 8 M 77%

Synthetic 2 4 TB 12 M 85 %

22

Real world datasets from different workloads

Synthetic datasets for testing scalability

Physical space bloat

23

Migration Utility

24

Scalability

Dataset Size No of

Files

Dedup Total running

time in parallel

mode

Home Directory 74 GB 78 K 49 % 24 sec

Debian 261 GB 448 59 % 1 min

VMDK 2.4 TB 2.4 K 62 % 13 min

EngWebBurt 1.3 TB 4 M 51 % 9 min

Synthetic 1 2.6 TB 8 M 77% 20 min

Synthetic 2 4 TB 12 M 85 % 28 min

25

Scales to large datasets

– ~30 minutes for 4TB and 12M file dataset with 85%

dedupe

rangoli: space management in deduped environments · rangoli: space management in deduped...

Documents