petal: distributed virtual disks
DESCRIPTION
PETAL: DISTRIBUTED VIRTUAL DISKS. E. K. Lee C. A. Thekkath DEC SRC. Highlights. Paper presents a distributed storage management system: Petal consists of a collection of network-connected servers that cooperatively manage a pool of physical disks - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/1.jpg)
PETAL:DISTRIBUTED VIRTUAL DISKS
E. K. LeeC. A. ThekkathDEC SRC
![Page 2: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/2.jpg)
• Paper presents a distributed storage management system:– Petal consists of a collection of network-
connected servers that cooperatively manage a pool of physical disks
– Client see Petal as a highly available block-level storage partitioned into virtual disks
Highlights
![Page 3: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/3.jpg)
Introduction
• Petal is a distributed storage system that– Tolerates single component failures– Can be geographically distributed to tolerate site
failures– Transparently reconfigures to expand in
performance or capacity– Uniformly balances load and capacity– Provides fast efficient support for backup and
recovery
![Page 4: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/4.jpg)
Petal User Interface
• Petal appears to its clients as a collection of virtual disks:– Block-level interface– Lower-level service than a DFS – Makes system easier to model, design,
implement and tune– Can support heterogeneous clients and
applications
![Page 5: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/5.jpg)
Client view
Scalable Network
BSD FFSBSD FFS NTFS EXT2 FS NTFS
Virtualdisks
Petal
![Page 6: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/6.jpg)
Physical view
Storage Server Storage Server Storage Server
Scalable Network
BSD FFSBSD FFS NTFS EXT2 FS NTFS
![Page 7: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/7.jpg)
Petal Server Modules
Global StateModule
RecoveryModule
Data AccessModule
LivelinessModule
Virtual toPhysical
![Page 8: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/8.jpg)
Overall design (I)
• All state information is maintained on servers– Clients maintain only hints
• Liveness module ensures that all servers will agree on the system operational status– Uses majority consensus and periodic
exchanges of “I’m alive”/”You’re alive?” messages
![Page 9: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/9.jpg)
Overall design (II)
• Information describing– current members of storage system and– currently supported virtual disksis replicated across all servers
• Global state module keeps this information consistent– Uses Lamport’s Paxos algorithm– Assumes fail-silent failures of servers
![Page 10: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/10.jpg)
Overall design (III)
• Data access and recovery modules– Control how client data are distributed and stored– Support
• Simple data striping w/o redundancy• Chained declustering
– It distributes mirrored data in a way that balances load in the event of a failure
![Page 11: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/11.jpg)
Address translation (I)
• Must translate virtual addresses <virtual-disk ID, offset>
into physical addresses<server ID, disk ID, offset>
• Mechanism should be fast and fault-tolerant
![Page 12: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/12.jpg)
Address translation (II)
• Uses three replicated data structures– Virtual disk directory:
translates virtual disk ID into a global map ID– Global map:
locates the server responsible for translating the given offset (block number)
– Physical map:Locates physical disk and computers physical offset within that disk
![Page 13: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/13.jpg)
Virtual to physical mapping
VDir
GMap
PMap0
Server 0
VDir
GMap
PMap2
Server 2
VDir
GMap
PMap1
Server 1vdiskID
offset
VDir
GMap
PMap2
Server 2
diskID and diskOffseton this server
![Page 14: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/14.jpg)
Address translation (III)
• Three step process:1. VDir translates virtual disk ID given by client
into a GMap ID2. Specified GMap finds server that can translate
given offset3. PMap of server translates GMap ID and offset
to a physical disk and a disk offset• Last two steps are almost always performed by
same server
![Page 15: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/15.jpg)
Address translation (IV)
• There is one GMap per virtual disk• That GMap specifies
– Tuple of servers spanned by the virtual disk– Redundancy scheme used to protect data– GMaps are immutable
• Cannot be modified• Must create a new GMap
![Page 16: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/16.jpg)
Address translation (V)
• PMaps are similar to page tables– Each PMap entry maps 64 KB of physical
disk space – Server that performs the translation will
usually perform the disk I/O• Keeping GMaps and PMaps separate minimizes
amount of global information that must be replicated
![Page 17: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/17.jpg)
Support for backups
• Petal supports snapshots of virtual disks• Snapshots are immutable copies of virtual disks
– Created using copy-on-write• VDir maps <virtual-disk ID, epoch(?)> into
<GMap ID, epoch>– Epoch identifies current version of virtual
disks and snapshots of past versions
![Page 18: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/18.jpg)
Incremental reconfiguration (I)
• Used to add/remove new servers and new disks• Three simple steps
1. Create new GMap2. Update VDir entries 3. Redistribute the data
• Challenge is to perform the reconfiguration concurrently with normal client requests
![Page 19: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/19.jpg)
Incremental reconfiguration (II)
• To solve the problem– Read requests will
• Try first new GMap • Switch to old GMap if new GMap has
no appropriate translation– Write requests will always use new GMap
![Page 20: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/20.jpg)
Incremental reconfiguration (III)
• Observe that new GMap must be created before any data are moved
– Too many read requests will have to consult both GMaps
• Seriously degrades system performance• Do instead incremental changes over a fenced
region of a virtual disk
![Page 21: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/21.jpg)
Chained declustering (I)
Virtual Disk
Server 2
D2
D1
D6
D5
Server 1
D1
D0
D5
D4
Server 0
D0
D3
D4
D7
Server 3
D3
D2
D7
D6
![Page 22: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/22.jpg)
Chained declustering (II)
• If one server fails, its workload will be almost equally distributed among remaining servers
• Petal uses a primary/secondary scheme for managing copies– Read requests can go to either primary or
secondary copy– Write requests must go first to
primary copy
![Page 23: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/23.jpg)
Petal prototype
• Four servers– Each has fourteen 4.3 GB disks
• Four clients• Links are 155 Mb/s ATM links• Petal RPC interface has 24 calls
![Page 24: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/24.jpg)
Latency of a virtual disk
![Page 25: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/25.jpg)
Throughput of a virtual disk
Throughput is mostly limited by CPU overhead(233 MHZ CPUs!)
![Page 26: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/26.jpg)
File system performance
(Modified Andrew Benchmark)
![Page 27: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/27.jpg)
Conclusion
• Block-level interface s simpler and more flexible than a FS interface
• Use of distributed software solutions allows geographic distribution
• Petal performance is acceptable but for write requests– Must wait for primary and secondary copies to
be successfully updated
![Page 28: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/28.jpg)
Paxos: the main idea
• Proposers propose decision values from an arbitrary input set and try to collect acceptances from a majority of the accepters
• Learners observe this ratification process and attempt to detect that ratification has occurred
• Agreement is enforced because only one proposal can get the votes of a majority of accepters
![Page 29: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/29.jpg)
Paxos: the assumptions
• Algorithm for consensus in a message-passing system
• Assumes the existence of Failure Detectors that let processes give up on stalled processes after some amount of time
• Processes can act as proposers, accepters, and learners– A process may combine all three roles
![Page 30: PETAL: DISTRIBUTED VIRTUAL DISKS](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5681463f550346895db348d9/html5/thumbnails/30.jpg)
Paxos: the tricky part
• The tricky part is to avoid deadlocks when– There are more than two proposals – Some of the processes fail
• Paxos lets– Proposers make new proposals– Accepters release their earlier votes for losing
proposals