low-cost data deduplication for virtual machine backup in cloud storage wei zhang, tao yang, gautham...

Low-Cost Data Deduplication for Virtual Machine Backup in

Cloud Storage

Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara

Hong Tang

Alibaba Inc.

USENIX HotStorage’2013

Motivation

• Virtual machines in the cloud can use frequent backup to improve service reliability Used in Alibaba’s Aliyun - the largest public cloud

service in China• High storage demand & large content duplicates

Daily backup workload: hundreds of TB @ Aliyun Number of VMs per cluster: tens of thousands

• Seek for inexpensive solutions

Architecture Consideration

• An external and dedicated backup storage system.

• High network traffic for transferring undeduplicated data

• Expensive

• A decentralized and co-hosted backup system with full

deduplication Lower cost

& traffic

Requirements

Nondedicated resource• Cohosted with existing cloud services• Resource friendly – small memory footprint and

CPU usage Compute and backup for tens of thousand VMs within

a few hours each day during light cloud workload.

Focus and Related Work

• Previous work Inline chunk-based deduplication

– High cost for fingerprint lookup Speedup fingerprint comparison with approximation

(e.g. subsampling, bloomfilter, stateless routing)• Focus of this paper

Not inline - shorten overall backup times of many VM images, but not individual request

Not offline - multi-stage parallel backup with small storage overhead, & limited computing resource

Work-in-progress

Key Ideas

• Separation of duplicate detection and data backup Different from inline deduplication.

• Buffered data redistribution in parallel duplicate detection

Stage 1: Collect fingerprints in parallel

Stage 2: Detect duplicates in parallel

Stage 3: Perform actual VM backup in parallel

VM Snapshot Representation

Data blocks are variable-sized

Segments are fix-sized

Stage 1: Deduplication request accumulation

➔Scan dirty data blocks

➔Exchange&accumulate dedup requests

➔Map data from VM-based to fingerprint-baseddistribution

Stage 2: Fingerprint comparison and summary output

•Load global index and dedup requests one partition at a time

•Compare fingerprintsin parallel

•Output dedupsummary from fingerprint-based to VM-based distribution

Stage 3: Non-duplicate data backup

•Load dup summaries

•Read dirty segments

•Output non-duplicatedata blocks

Memory Usage per Machine at Different Stages

• Stage 1: Request accumulation 1 I/O buffer to read dirty segments p network send and p recv buffers for p machines q dedup request buffers for local disk write of q partitions

• Stage 2: Fingerprint comparison Space for hosting 1 partition index and corresponding requests p network send and p recv buffers, v local summary buffers for

disk write

• Stage 3: Nonduplicate backup

• An I/O buffer to read dirty segments and write non-duplicates

• Duplicate summary within dirty segments

Issues with Incidental Redundancy

• Two VM blocks with the same fingerprint are created in parallel in different machines Both are identified as new blocks The rest of occurrences are detected as duplicates

and logged• Repaired inconsistency periodically during index

update

Snapshot Deletion

• Mark-and-sweep– A block can be deleted if its reference count is zero

• Similar to deduplication stages Scan the meta data and accumulate block reference

pointers Compute the reference count of each index entry,

partition by partition Log deletion instructions

• Periodically perform a compact operation • when its deletion log is too big.

Evaluation

• Evaluated on a cluster of Dual quad-core Intel Nehalem 2.4Hz E5530 with 24GB

memory. • Test data from Alibaba Aliyuan cloud

41TB. 10 snapshots per VM Segment size: 2MB. Avg. Block size: 4KB

• Evaluation objectives 1) Analyze the deduplication throughput and

effectiveness for a large number of VMs. 2) Examine the impacts of buffering during metadata

exchange.

Data Characteristics

• Each VM uses 40GB storage space on average

• OS and user data disks: each takes ~50% of space

• OS data 7 main stream OS releases: Debian, Ubuntu, Redhat, CentOS, Win2003

32bit, win2003 64 bit and win2008 64 bit.• User data

From 1323 VM users

Setting & Resource Usage per Machine

• P=100 machines. 25VMs per machine• Disk

• 8 GB metadata usage• 10millsec local disk seek cost• 50MB/second I/O per machine

• < 16.7% of local IO bandwidth usage.• Memory usage: ~35MB• CPU: Single-thread execution per machine

• 10-13% of single core

Parallel Time When Memory Limit Varies

Performance when 35MB memory used per machine

Option1: unoptimized data redistribution.

Conclusions

• Low-cost multi-stage parallel deduplication for simultaneous backup of many VM images Co-hosted with other cloud services Tradeoff:

• Not optimized for individual backup request• Read dirty data twice.

• Work-in-progress• Evaluation

– Backup throughput of 100 machines about 8.76GB per second for 2500 VMs

– Resource friendly to the existing cluster services.

Questions?

low-cost data deduplication for virtual machine backup in cloud storage wei zhang, tao yang, gautham...

Documents

parallel slide

parallel stage

data backup different

sized slide

progress slide

frequent backup

nonduplicate backup

vmbased distribution