parallax - xen · 2011-02-28 · parallax storage system use snapshots as a unifying tool for...
TRANSCRIPT
Parallax
Dutch MeyerUniversity of British [email protected]
The Plan
Virtual Machines and Storage Parallax Feature Overview Technical Design System Evaluation Conclusion
Parallax is a Storage Service
Observations on Naïve storage
Virtual machines can be created anddestroyed easily - storage can’t
VM encapsulation make capturing whole-machine state attractive, but capturing awhole disk image is slow
Giving similar VMs similar disk imagesresults in wasted space
Our Research Questions
How do we make volume provisioningagile enough to match VM creation?
Can we capture whole-disk state at near-continuous granularity?
How much data redundancy can weeliminate?
How much overhead to do all of this well?
Parallax storage system
Use snapshots as a unifying tool for Provisioning new volumes Data sharing Low overhead state capture and backup
Allow block-level layout optimization Allow disconnected/degraded operation Compatibility due to VM based architecture
operating at the block level
Snapshots
Data Protection Low Granularity (eg days)
“What if” configuration and testing Backup
High Granularity (eg ms) Legal compliance Paranoia
Time Travel – By capturing whole-machine stateat high frequency, we can revisit previousmachine states
Provisioning via Gold Mastering
Use snapshots to create a copy of somereference volume, which can be furtherspecialized
Requirements includeGlobal availabilityEfficient operationNo hard limits on the number of volumes
Data Sharing
Commonly derived disks can sharecommon data
Sharing is read-only, COW when data ismodified
We can further eliminate redundancy bydetecting duplicate blocks and dedupingthem (current focus)
Parallax Implementation
Building Virtual Disks Locking and Synchronization Storage Services
System Review
Parallax engine is a user-mode tapdiskdriver for block management
Provides services to any VM sharing thesame physical machine
Federates across multiple physicalmachines to share a single volume ofstorage
Building Virtual Disks
Flexibility in block placement is essentialto providing disk isolation
Parallax uses a radix tree to facilitate thisFixed heightRoot is linked to a disk imageNodes are disk blocks, containing an array of
pointers
Radix Nodes and Trees
Taking A Snapshot
IO Batching
Parallax follows the semantics of a physical disk Simultaneous requests may be completed in any order Must retain “crash consistency”
Updating radix trees can involve several IO operations Batching becomes essential to maintaining performance Ordering constraints are imposed for crash consistency
We use a dependency tracking system to issue writes inthe correct order
Writes are aggressively pipelined – similar to instructionscheduling
Parallax Implementation
Building Virtual Disks Locking and Synchronization Storage Services
Federating Physical Machines
All machines share a single disk Some synchronization is required between
physical machines Data plane is protected through long lived
coarse grained allocation Control plane requires a lock manager
Lock Management
Current System has 3 contentious locksCreating a virtual diskClaiming a virtual diskRequesting a new extent
In practice these locks are very infrequent It is possible to further limit contention in
our design
Parallax Implementation
Building Virtual Disks Locking and Synchronization Storage Service
Degraded Operation
Evaluation: PerformancePer Request LatencySystem Throughput
Evaluation: SnapshotsStorage OverheadsSnapshot overhead
Conclusion
We can use VM based encapsulation toextend the services normally provided in astorage stack
Despite using several potentially high-overhead techniques, parallax achievesreasonable performance
Future Work
Working on deduping, layout optimization Expose features to aware file systems More storage services for VMs: caching,
encryption, etc. General release
End of Presentation
Thanks! Questions?
Extents
We wish to minimize contention for theshared disk
The simple approach is to partition thedisk into large extents which can be givenexclusively to individuals
We use a 2GB extent size currently
Translating: Virtual to Physical01011100000101011100 0001
Root
0101
A
1100
B
0001
C