low-cost data deduplication for virtual machine backup in cloud storage wei zhang, tao yang, gautham...
TRANSCRIPT
Low-Cost Data Deduplication for Virtual Machine Backup in
Cloud Storage
Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara
Hong Tang
Alibaba Inc.
USENIX HotStorage’2013
Motivation
• Virtual machines in the cloud can use frequent backup to improve service reliability Used in Alibaba’s Aliyun - the largest public cloud
service in China• High storage demand & large content duplicates
Daily backup workload: hundreds of TB @ Aliyun Number of VMs per cluster: tens of thousands
• Seek for inexpensive solutions
Architecture Consideration
• An external and dedicated backup storage system.
• High network traffic for transferring undeduplicated data
• Expensive
• A decentralized and co-hosted backup system with full
deduplication Lower cost
& traffic
Requirements
Nondedicated resource• Cohosted with existing cloud services• Resource friendly – small memory footprint and
CPU usage Compute and backup for tens of thousand VMs within
a few hours each day during light cloud workload.
Focus and Related Work
• Previous work Inline chunk-based deduplication
– High cost for fingerprint lookup Speedup fingerprint comparison with approximation
(e.g. subsampling, bloomfilter, stateless routing)• Focus of this paper
Not inline - shorten overall backup times of many VM images, but not individual request
Not offline - multi-stage parallel backup with small storage overhead, & limited computing resource
Work-in-progress
Key Ideas
• Separation of duplicate detection and data backup Different from inline deduplication.
• Buffered data redistribution in parallel duplicate detection
Stage 1: Collect fingerprints in parallel
Stage 2: Detect duplicates in parallel
Stage 3: Perform actual VM backup in parallel
Stage 1: Deduplication request accumulation
➔Scan dirty data blocks
➔Exchange&accumulate dedup requests
➔Map data from VM-based to fingerprint-baseddistribution
Stage 2: Fingerprint comparison and summary output
•Load global index and dedup requests one partition at a time
•Compare fingerprintsin parallel
•Output dedupsummary from fingerprint-based to VM-based distribution
Stage 3: Non-duplicate data backup
•Load dup summaries
•Read dirty segments
•Output non-duplicatedata blocks
Memory Usage per Machine at Different Stages
• Stage 1: Request accumulation 1 I/O buffer to read dirty segments p network send and p recv buffers for p machines q dedup request buffers for local disk write of q partitions
• Stage 2: Fingerprint comparison Space for hosting 1 partition index and corresponding requests p network send and p recv buffers, v local summary buffers for
disk write
• Stage 3: Nonduplicate backup
• An I/O buffer to read dirty segments and write non-duplicates
• Duplicate summary within dirty segments
Issues with Incidental Redundancy
• Two VM blocks with the same fingerprint are created in parallel in different machines Both are identified as new blocks The rest of occurrences are detected as duplicates
and logged• Repaired inconsistency periodically during index
update
Snapshot Deletion
• Mark-and-sweep– A block can be deleted if its reference count is zero
• Similar to deduplication stages Scan the meta data and accumulate block reference
pointers Compute the reference count of each index entry,
partition by partition Log deletion instructions
• Periodically perform a compact operation • when its deletion log is too big.
Evaluation
• Evaluated on a cluster of Dual quad-core Intel Nehalem 2.4Hz E5530 with 24GB
memory. • Test data from Alibaba Aliyuan cloud
41TB. 10 snapshots per VM Segment size: 2MB. Avg. Block size: 4KB
• Evaluation objectives 1) Analyze the deduplication throughput and
effectiveness for a large number of VMs. 2) Examine the impacts of buffering during metadata
exchange.
Data Characteristics
• Each VM uses 40GB storage space on average
• OS and user data disks: each takes ~50% of space
• OS data 7 main stream OS releases: Debian, Ubuntu, Redhat, CentOS, Win2003
32bit, win2003 64 bit and win2008 64 bit.• User data
From 1323 VM users
Setting & Resource Usage per Machine
• P=100 machines. 25VMs per machine• Disk
• 8 GB metadata usage• 10millsec local disk seek cost• 50MB/second I/O per machine
• < 16.7% of local IO bandwidth usage.• Memory usage: ~35MB• CPU: Single-thread execution per machine
• 10-13% of single core
Conclusions
• Low-cost multi-stage parallel deduplication for simultaneous backup of many VM images Co-hosted with other cloud services Tradeoff:
• Not optimized for individual backup request• Read dirty data twice.
• Work-in-progress• Evaluation
– Backup throughput of 100 machines about 8.76GB per second for 2500 VMs
– Resource friendly to the existing cluster services.