venti : a new approach to archival storage
DESCRIPTION
Venti : A new approach to archival storage. Sean Quinlan & Sean Dorward Bell Labs, Lucent Technology FAST 2002 Presented by: Fabián E. Bustamante. Archival Storage. Abundant storage - storing data for long time, even forever Archival system – write-once policy – changes how we see storage - PowerPoint PPT PresentationTRANSCRIPT
CS 443 Advanced OS
Fabián E. Bustamante, Spring 2005
Venti : A new approach to archival storage
Sean Quinlan & Sean DorwardBell Labs, Lucent Technology
FAST 2002
Presented by: Fabián E. Bustamante
2
Archival Storage
Abundant storage - storing data for long time, even foreverArchival system – write-once policy – changes how we see storagePrevalent form of archival storage - tape backup– Central service for a number of clients– Restoring data tedious and error prone (requires sys admin or
privileged software)– Infrequent, so problems may go undetected for long– Tradeoff bet/ performance of backup & restore (incremental
backups are more efficient to generate, but not self-contained)Snapshots (Plan 8, Andrew FS, …)– A consistent read-only view of FS at some point in past– Maintains file system permissions – Can be accessed with standard tools – ls, cat, cp, grep, diff– Avoids tradeoff between full vs. incremental backup
• Looks like full backup• Implementation resembles incremental backup – share blocks• Only a small number of versions are kept
3
Venti
GOAL: “To provide a write-once archival repository that can be shared by multiple client machines and applications”Block level network storage system– Actually a backend storage for client apps– Block level i/f places few restrictions on data
structures/format
Blocks addressed by hash of their contents– Uses SHA-1 algorithm– SHA-1 output is 160 bit (20 byte) fingerprint of data block
Write once policy– Once written cannot be deleted
Multiple writes of same data coalesced– Data sharing – increases effective storage capacity– Makes write operation idempotent
4
Venti
Multiple clients can share a Venti server– Hash fn gives an universal namespace
Inherent integrity checking of data– Fingerprint computation on retrieval
Caching is simplifiedUses magnetic disks as storage technology– Access time comparable to non-archival data
Hashing function – sha1 is more than enough for now
Exabyte of storage (1018 bytes) in 8 Kbytes blocks
(~1014 blocks - n) with sha1 (b = 160 bits) – p < 10-20
b
nnp
2
1
2
)1(
Number of pairs of blocks Probability that a
given pair will collide
5
Applications - general
Venti as a building block to construct storage appsData divided into blocksApp needs fingerprint for retrievalOne approach - fingerprints packed into pointer blocks, also written to serverRepeated recursively to get single fingerprint - root of tree
6
Applications - general
Tree cannot be modified, butUnchanged sections of tree reused More complex data structures– Mixing data & fingerprints
in a block
e.g. structure for storing file system– 3 types of blocks
• Directory – file metadata + root fingerprint
• Pointer • Data
7
Example app: Vac
Similar to zip & tar – store collection of files & directories as single objectTree of blocks formed for selected filesvac archive file - 45 bytes long– 20 bytes for root fingerprint– 25 bytes fixed header string– any amount of data compressed down to 45 bytes
Unvac – to restore file from archiveDuplicate copies of file coalesced on server– Multiple users vac same data – only 1 copy stored– vac on changed contents
8
Example app: Physical level backup
Copy the raw disk blocks to Venti– No need to walk file hierarchy– Higher throughput
Duplicate blocks are coalescedFree block – null in pointer treeUser sees full backup of deviceStorage space advs of incremental backup retainedRandom access possible– Directly mounting a backup file system image (w/ OS
support)– Enables lazy (on demand) restore
9
Example app - Plan 9 FS
Previously– Magnetic disk + write-once optical jukebox– Cache for faster access & accumulated changes bet/
snapshots– Cache smaller than active file system– Cache misses are painful
Plan 9 FS on top of Venti– Venti as storage device (instead of jukebox)– Equalizes access to both active & archival storage– Cache can also be smaller
10
Implementation
Append-only log of data blocks– RAID array
Separate index maps fingerprints to log location
– Can be regenerated from data log
Index Cache
FS
Block Cache Data
Index
Venti Server
Network
Client
Client
Client
11
Venti’s data log
Log divided into arenas for
maintenance
Append only section of data blocks (variable
size)For integrity checking
& data recovery
Data compressed (LZ77) – • encoding – if, algo.• esize size after compression
For fast index rebuild & error recovery
Data log and directory grow in opposite directionsWhen arena is filled, mark as sealed, fingerprint compute
12
Venti’s index
No. of fingerprints >> no. of blocks on a server
Index as disk-resident hash table
Hashing function maps fingerprints to index buckets
Going through index is main performance penalty– Caching (block and index)– Index striped across multiple disks – fingerprint location in index is random
& index is too big to fit in memory– Write buffering (to increase number of concurrent accesses)
Block cache - hit - index lookup & data log bypassed
Index cache - hit - index lookup bypassed
13
Performance: computing environments
Testbed– Dedicated dual 550Mhz Pentium 3, 2GB mem.– Access through a 100Mbps Ethernet– Data log stored on a 500GB MaxTronic IDE Raid 5– Index stored on 8 Seagate Cheetah 18XL 9GB
SCSI
Micro-benchmark results
Sequential reads
Random reads
Virgin writes Duplicates writes
Uncached 0.9 0.4 3.7 5.6
Index cache 4.2 0.7 - 6.2
Block cache 6.8 - - 6.5
Raw raid 14.8 1.0 12.4 12.4
14
Storage usage – file servers used
2 plan 9 file servers, bootes and emelie– Spanning 1990 to 2001
522 user accounts, 50-100 active at a time
Numerous development projects hosted
Several large data sets– Astronomical data, satellite imagery, multimedia files
Next figures– Size of active file system– Space consumed on jukebox– Space when using Venti– Ratio – using Venti reduces the cost of retaining snapshots
(so ratio is smaller)
15
Storage usage
16
Storage saving
When stored on Venti, size of jukebox data reduced by 3 factors– Elimination of duplicate blocks– Elimination of block fragmentation– Compression of block contents
% reduction of storage Bootes Emelie
Elimination of duplicates 27.8% 31.3%
Elimination of fragments 10.2% 25.4%
Data compression 33.8% 54.1%
Total reduction 59.7% 76.5%
17
Reliability & recovery
Tools for integrity checking & error recovery– Verifying structure of arena– Checking there is an index entry for every block in data log,
vice versa– Rebuilding index from data log– Copying arena to removable media
Data log on RAID 5 disk array – Protection against single drive failures
Off-site mirrors for extra protection, if that’s not enough …
Stored sealed arenas into removable media
18
Future Work
Load balancing– Distribute Venti across multiple machines– Replicate server– Use of proxy server to hide it from client
Better access control– Currently just authentication to server– Single root fingerprint gives access to entire file
tree
Use of variable sized blocks as in LBFS
19
Conclusion
Addressing block by SHA-1 hash of contents – good fit for archival storage
Magnetic disks as storage technology– Large capacity at low price – online storage– Random access– Performance comparable to non-archival data
Write once model– Reduces accidental or malicious data loss– Simplifies administration– Simplifies caching– Allows sharing of data