a study of practical deduplication · 2020. 5. 19. · the benefit of fine grained dedup is < 20%...

33
A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    A study of practical deduplication

    Dutch T. Meyer University of British Columbia

    Microsoft Research Intern William Bolosky Microsoft Research

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Why Dutch is Not Here

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    A study of practical deduplication

    Dutch T. Meyer University of British Columbia

    Microsoft Research Intern William Bolosky Microsoft Research

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Why study deduplication?

    $0.046 per GB

    9ms per seek

    $0.039 XXXX

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    When do we exploit duplicates?

    It Depends. How much can you get back from deduping? How does fragmenting files affect performance? How often will you access the data?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Outline

    Intro Methodology “There’s more here than dedup” teaser

    (intermission)

    Deduplication Background Deplication Analysis Conclusion

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Methodology

    MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data) MD5(name) Metadata MD5(data)

    Once per week for 4 weeks. ~875 file systems ~40TB ~200M Files

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    There’s more here than dedup!

    We update and extend filesystem metadata findings from 2000 and 2004

    File system complexity is growing Read the paper to answer questions like: Are my files bigger now than they used to be?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Teaser: Histogram of file size

    0%

    2%

    4%

    6%

    8%

    10%

    12%

    14%

    0 8 128 2K 32K 512K 8M 128M File Size (bytes), power-of-two bins

    2009 2004 2000

    4K Since 1981!

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    There’s more here than dedup!

    How fragmented are my files?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Teaser: Layout and Organization

    High linearity: only 4% of files fragmented in practice Most windows machines defrag weekly

    One quarter of fragmented files have at least 170 fragments

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Intermission

    Intro Methodology “There’s more here than dedup” teaser

    (intermission)

    Deduplication Background Deplication Analysis Conclusion

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Dedup Background

    foo 01101010….. ….110010101

    bar 01101010….. ….110010101

    Whole file Deduplication

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Dedup Background

    foo 01101010….. ….110010101

    bar 01101010….. ….110010101

    Fixed Chunk Deduplication

    1

    01101010…..

    01101010…..

    ….110010101

    ….1100101011

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Dedup Background

    foo 01101010….. ….110010101

    bar 01101010….. ….110010101

    Rabin Figerprinting

    1

    110101 101010 010100

    101101010…..

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    The Deduplication Space

    Algorithm Parameters

    Cost Deduplication effectiveness

    Whole-file Low Lowest

    Fixed Chunk

    Chunk Size Seeks CPU Complexity

    Middle

    Rabin fingerprints

    Average Chunk Size

    Seeks More CPU More Complexity

    Highest

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    What is the relative deduplication rate of the algorithms?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Dedup by method and chunk size

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    64K 32K 16K 8K

    Spac

    e D

    edup

    licat

    ed

    Chunk Size

    Whole File Fixed-Chunk Rabin

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    What if I was doing full weekly backups?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Backup dedup over 4 weeks

    0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

    Whole File

    Whole File + Sparse

    8K rabin

    Deduplicated Space

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    How does the number of filesystems influence deduplication?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Dedup by filesystem count

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    1 2 4 8 16 32 64 128 256 512 Whole Set

    Spac

    e D

    edup

    licat

    ed

    Deduplication Domain Size (file systems)

    Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    So what is filling up all this space?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Bytes by containing file size

    0%

    2%

    4%

    6%

    8%

    10%

    12%

    1K 16K 256K 4M 64M 1G 16G 256G

    Perc

    enta

    ge o

    f Tot

    al B

    ytes

    Containing File Size (Bytes), Power-of-2 bins

    2000 2004 2009

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    What types of files take up disk space?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Disk consumption by file type

    dll dll ø

    pdb vhd

    dll exe

    pdb

    lib pst exe

    vhd pch

    wma

    pdb

    mp3

    lib

    exe

    lib

    cab

    pch

    chm

    pst

    cab

    cab

    mp3

    wma

    ø

    ø

    iso

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    2000 2004 2009

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Disk consumption by file type

    dll dll ø

    pdb vhd

    dll exe

    pdb

    lib pst exe

    vhd pch

    wma

    pdb

    mp3

    lib

    exe

    lib

    cab

    pch

    chm

    pst

    cab

    cab

    mp3

    wma

    ø

    ø

    iso

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    2000 2004 2009

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Which of these types deduplicate well?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Whole-file duplicates

    Extension % of Duplicate

    Space Mean File

    Size (bytes) % of

    Total Space

    dll 20% 521K 10% lib 11% 1080K 7% pdb 11% 2M 7% 7% 277K 13% exe 6% 572K 4% cab 4% 4M 2% msp 3% 15M 2% msi 3% 5M 1% iso 2% 436M 2% 1% 604K

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    What files make up the 20% difference between whole file dedup and sparse file, as compared

    to more aggressive deduplication?

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Where does fine granularity help?

    vhd vhd

    pch

    lib

    dll

    obj

    pdb

    pdb

    lib

    pch

    wma

    iso

    pst

    dll

    ø

    avhd

    avhd

    wma

    mo3

    wim

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    8K Fixed 8K Rabin

    Perc

    enta

    ge o

    f diff

    eren

    ce v

    s.

    who

    le fi

    le +

    spa

    rse

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Last plea to read the whole paper

    ~4x more results in paper! Real world filesystem analysis is hard Eight machines months in query processing Requires careful simplifying assumptions Requires heavy optimization

  • 2011 Storage Developer Conference. © Dutch Meyer. All Rights Reserved.

    Conclusion

    The benefit of fine grained dedup is < 20% Potentially just a fraction of that.

    Fragmentation is a manageable problem Read the paper for more metadata results

    We’re releasing this dataset (when I finish the

    anonymization)

    A study of practical deduplicationWhy Dutch is Not HereA study of practical deduplicationWhy study deduplication?When do we exploit duplicates?OutlineMethodologyThere’s more here than dedup!Teaser: Histogram of file sizeThere’s more here than dedup!Teaser: Layout and OrganizationIntermissionDedup BackgroundDedup BackgroundDedup BackgroundThe Deduplication SpaceSlide Number 17Dedup by method and chunk sizeSlide Number 19Backup dedup over 4 weeksSlide Number 21Dedup by filesystem countSlide Number 23Bytes by containing file sizeSlide Number 25Disk consumption by file typeDisk consumption by file typeSlide Number 28Whole-file duplicatesSlide Number 30Where does fine granularity help?Last plea to read the whole paperConclusion