data deduplication: venti and its improvements

1. Data Deduplication: Venti and its improvements Umair Amjad 12-5044 [email protected] Department of Computer Science, National University of Computer and Emerging Sciences, Pakistan Abstract Entire world is adapting digital technologies, converting from legacy approach to Digital approach. Data is the primary thing which is available in digital form everywhere. To store this massive data, the storage methodology should be efficient as well as intelligent enough to find the redundant data to save. Data deduplication techniques are widely used by storage servers to eliminate the possibilities of storing multiple copies of the data. Deduplication identifies duplicate data portions going to be stored in storage systems also removes duplication in existing stored data in storage systems. Hence yield a significant cost saving. This paper is about data deduplication, taking Venti as base case discussed it in detail and also identify area of improvements in Venti which are addressed by other papers. Keywords Data deduplication; data storage; hash index; venti; archival data; 1. Introduction The world is producing the large number of digital data that is growing rapidly. According to a study, the information producing per year to the digital universe is growing by 57% annually. This whopping growth of information is imparting a considerable load on storage systems. Thirty-five percent of this information is generated by enterprises and therefore must be retained due to regulatory compliance and legal reasons. So it is critical to backup the data regularly to a disaster recovery site for data availability and integrity. Rapidly developing data arises many challenges to the existing storage systems. One observation is that a significant fraction of information contains duplicates, due to reasons such as backups, copies, and version updates. Thus, deduplication techniques have been invented to avoid storing redundant information. A number of trends have motivated the creation of deduplication solutions. Archival systems such as Venti have identified significant information redundancy within and across machines due to update versions and commonly installed applications and libraries. In addition to storage overhead, duplicate file content can also have other negative effects on the system. As files are accessed, they are cached in memory and in the hard disk cache. Duplicate content can consume unnecessary memory cache that could be used to cache additional unique content. Deduplication solves these issues by locating identical content and handling it appropriately. Instead of storing the same file content multiple times, we can have a new file that references the identical content already stored in the system. The use of deduplication results in more efficient use of both memory cache and storage capacity. This paper is taking Venti as base case for data deduplication and its missing areas. After identification of missing areas there solution is proposed in reference to other research papers. 2. Background In storage archives a large quantity of data is redundant and slight changed to another chunk of data. The term data deduplication points to the techniques that saves only one single instance of replicated data, and provide links to that instance of copy in place of storing other original copies of this data. There are many techniques exists for eliminating redundancy from the stored data. At present data deduplication has gained popularity in the research community . Data deduplication is a specialized data compression technique for eliminating redundant data, typically to improve

2. storage utilization . In the deduplication process , redundant data is left and not stored. By the evolution of services from tape to disk, data deduplication has turn into a key element in the backup process. It specifies that only one copy of that data is saved in the datacenter. Every user, who want to access that copy linked to that single instance of copy. So it is clear that data deduplication help to decrease the size of data center. So it could say that deduplication means that the number of the replication of data that were usually duplicated on the cloud should be controlled and managed to shrink the physical storage space requested for such replications. The basic steps for deduplication are: 1. In first step files are divided into small segments. 2. After the segment creation new and the existing data are checked for similarity by comparing fingerprints created by hashing algorithm. 3. Then Metadata structures are updated. 4. Segments are compressed. 5. All the duplicate data is deleted and data integrity check is performed. 2.1 Types of Data Deduplication There are two major categories of data deduplication on which all research is based. 1. Offline Data deduplication(Target based): In an offline deduplication state, first data is written to the storage disk and deduplication process take place at a later time. It is performed on the target data storage center. In this case the client is unmodified and not aware of any deduplication. This technology improves storage utilization and no one need to wait for hash based calculations, but does not save bandwidth. 2. Online Data deduplication(Source based): In an online deduplication state, replicate data is deleted before being written to the storage disk. It is performed on the data at the source before its transferred. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is increased bandwidth and storage efficiency. But, this enforces extra computational load on the backup client. Replicates are changed by pointers and the actual replicate data is never sent over the network. Once the timing of data deduplication has been decided then there are number of existing techniques that can be apply. The most used deduplication approaches are file level hashing and block level hashing. 1. File Level hashing : In a file level hashing technique, the whole file is directed to a hashing function. The hashing function is always cryptographic hash like MD5 or SHA-1. The cryptographic hash is used to find entire replicate files. This approach is speedy with low computation and low additional meta data overhead. It works very well for complete system backups when total duplicate files are more common. However, the larger granularity of replicate matching stops it from matching two files that only differ by one single byte or bit of data. 2. Block Level Hashing: It means the file is broken into a number of smaller sections before data deduplication. The number of sections depends on the type of approach that is being used. The two most common types of block level hashing are fixed-size chunking and variable-length chunking. In a fixed-size chunking approach, a file is divided up into a number of fixed-size pieces called chunks. In a variable-length chunking approach, a file is broken up into chunks of variable length. Each section is passed to a cryptographic hash function (usually MD5 or SHA-1) to get the chunk identifier. The chunk identifier is used to locate replicate data.

3. File internal changes, will cause the entire file need to store. PPT and other files may need to change some simple content, such as changing the page to display the new report or the dates, which can lead to re-store the entire document. Block level data deduplication technology stores only one version of the paper and the next part of the changes between versions. File level technology, generally less than 5:1 compression ratio, while the block-level storage technology can compress the data capacity of 20: 1 or even 50: 1. 2.2 Methodologies of Deduplication At present, the research of deduplication focuses on two aspects. One is to remove the duplicate data as much as possible and then reduce the storage capacity requirement. The other is the efficiency in the resources required to achieve. Most of the available traditional backup systems use file-level deduplication. However the data deduplication technology can exploit inter-file and intra-file information redundancy to eliminate duplicate or similarity data at the granularity block or byte. Some of the available architecture follows the source deduplication. However because of this approach, user has to face delay in sending data to the backup store, and the rest of the available architectures which support target deduplication strategy provide single system deduplication that means at the target side only single system (Server) handles all the user requests to store data and maintains the hash index for the number of disks attached to it. Venti: It is a network storage system. It applies identical hash values to find block contents so that it decreases the data occupation of storage area. Venti generates blocks for huge storage applications and inspire a write-once policy to avoid collision of the data. This network storage system emerged in the early stages of network storage, so it is not suitable to deal with avast data, and the system is not scalable. 3. Venti as a base case The key idea behind Venti, is to identify data blocks by a hash of their contents, also called fingerprint in this paper. Fingerprint is the source for all the obvious benefits of Venti. As blocks are addressed by the fingerprint of their contents, a block cannot be modified without changing its address (write-once behavior). Writes are idempotent, since multiple writes of the same data can be coalesced and do not require additional storage. Without cooperating or coordinating, multiple clients can share the data blocks with Venti server. Inherent integrity checking of data is ensured, since both the client and the server can compute the fingerprint of the data and compare it to the requested fingerprint, when a block is retrieved; and Features like replication, caching, and load balancing are facilitated; because the contents of a particular block are immutable, the problem of data coherency is greatly reduced. The main challenge of the work, on the other hand, is also brought about by hashing. The design of Venti requires a hash function that could generate a unique fingerprint for every data block that a client may want to store. Venti employs a cryptographic hash function, Sha1, for which it is computationally infeasible to find two distinct inputs that hash to the same value. (To date, there are no known collisions with Sha1.) As to the choice of storage technology, the authors make a good enough argument to use magnetic disks, by comparing the prices and performance of disks and optical storage systems.

4. Each block is prefixed by a header that describes the contents of the block. The primary purpose of the header is to provide integrity checking during normal operation and to assist in data recovery. The header includes a magic number, the fingerprint and size of the block, the time when the block was first written, and identity of the user that wrote it. The header also includes a user-supplied type identifier, which is explained in Section 7. Note, only one copy of a given block is stored in the log, thus the user and time fields correspond to the first time the block was stored to the server. The encoding field in the block header indicates whether the data was compressed and, if so, the algorithm used. The e-size field indicates the size of the data after compression, enabling the location of the next block in the arena to be determined. In addition to a log of data blocks, an arena includes a header, a directory, and a trailer. The header identifies the arena. The directory contains a copy of the block header and offset for every block in the arena. By replicating the headers of all the blocks in one relatively small part of the arena, the server can rapidly check or rebuild the system's global block index. The directory also facilitates error recovery if part of the arena is destroyed or corrupted. The trailer summarizes the current state of the arena itself, including the number of blocks and the size of the log. Within the arena, the data log and the directory start at opposite ends and grow towards each other. When the arena is filled, it is marked as sealed, and a fingerprint is computed for the contents of the entire arena. Sealed arenas are never modified. The basic operation of Venti is to store and retrieve blocks based on their fingerprints. A fingerprint is 160 bits long, and the number of possible fingerprints far exceeds the number of blocks stored on a server. The disparity between the number of fingerprints and blocks means it is impractical to map the fingerprint directly to a location on a storage device. Instead, we use an index to locate a block within the log. Index is implemented using a disk-resident hash table. The index is divided into fixed-sized buckets, each of which is stored as a single disk block. Each bucket contains the index map for a small section of the fingerprint space. A hash function is used to map fingerprints to index buckets in a roughly uniform manner, and then the bucket is examined using binary search. This structure is simple and efficient, requiring one disk access to locate a block in almost all cases. Three applications, Vac, physical backup, and usage with Plan 9 file system, are demonstrated to show the effectiveness of Venti. In addition to the development of the Venti prototype, a collection of tools for integrity checking and error recovery were built. The authors also gave some preliminary performance results for read and write operations with the Venti prototype. By using disks, they've shown an access time for archival data that is comparable to non-archival data. However, they also indicated the main problem: the uncached sequential read performance is particularly bad, due to the requirement of random read of the index of the sequential reads. They've pointed it out one possible solution: read-ahead. 4. Improvements in Venti There are three parameters which are identified in Venti paper, those required improvement.

5. 4.1 Hashing Collision: 'A Comparison Study of Deduplication Implementations with Small-Scale Workloads' solves the problem of venti which is to have hash collision. The design of Venti requires a hash function that generates a unique fingerprint for every data block that a client may want to store. For a server of a given capacity, the likelihood that two different blocks will have the same hash value, also known as a collision can be determined. Although probability to have identical values of key is extremely low but still to make sure, Small-Scale Workloads use both encryption algorithms SHA256 and MD5 simultaneously. Each of the hash functions maps to one of two hash tables. 4.2 Fix size chunking: 'A Low-bandwidth Network File System' named as LBFS addresses this problem by considering only non-overlapping chunks of files and avoids sensitivity to shifting file offsets by setting chunk boundaries based on file contents, rather than on position within a file. Insertions and deletions therefore only affect the surrounding chunks. To divide a file into chunks, LBFS examines every (overlapping) 48-byte region of the file and with probability each regions contents considers it to be the end of a data chunk. LBFS selects these boundary regions called breakpoints using Rabin fingerprints. Figure shows how LBFS might divide up a file and what happens to chunk boundaries after a series of edits. 1. shows the original file, divided into variable length chunks with breakpoints determined by a hash of each 48-byte region. 2. shows the effects of inserting some text into the file. The text is inserted in chunk c4 , producing a new, larger chunk c8 . However, all other chunks remain the same. Thus, one need only send c8 to transfer the new file to a recipient that already has the old version. 4.3 Better Access Control: 'A Low-bandwidth Network File System' uses RPC library which support for authenticating and encrypting traffic between a client and server. The entire LBFS protocol, RPC headers and all, is passed through gzip compression, tagged with a message authentication code, and then encrypted. At mount time, the client and server negotiate a session key, the server authenticates itself to the user, and the user authenticates herself to the client, all using public key cryptography. We added support for compression. The client and server communicate over TCP using Sun RPC. 'POTSHARDS: Secure Long-Term Archival Storage Without Encryption' uses secret splitting and approximate pointers as a way to move security from encryption to authentication and to avoid reliance on encryption algorithms that may be compromised at some point in the future. Unlike encryption, secret splitting provides information-theoretic security. Second, each user maintains a separate, recoverable index over her data, so a compromised index does not affect the other users and a lost index is not equivalent to data deletion. More importantly, in the event that a user loses her index, both the index and the data itself can be securely reconstructed from the users shares stored across multiple archives.

6. 5. Conclusion Archival data is growing exponentially so it is much needed to have system which can eliminate data duplication in a best way. Although paper have eloborated Venti in depth and its improvement areas; three major issues of Venti are discussed but there may be the cases when these proposed solutions may fail. For hash case may occurs when SHA and MD5 both create duplicate keys. Similarly in second part, content based chunking is high computational task so it can be avoid by further improvement. Venti is not experimented on distributed enviorement so that can be the idea candidate for future work. 6. References [1] "Deduplication and Compression Techniques in Cloud Design" by Amrita Upadhyay, Pratibha R Balihalli, Shashibhushan Ivaturi and Shrisha Rao 2012 IEEE [2] "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System" by Benjamin Zhu Data Domain, Inc. 6th USENIX Conference on File and Storage Technologies [3] P. Kulkarni, J. LaVoie, F. Douglis and J. Tracey Redundancy elimination within large collections of files. On 2004 in Proc. USENIX 2004 Annual Technical Conference. [4] Dave Russell: Data De-duplication Will Be Even Bigger in 2010, Gartner, 8 February 2010. [5] Mark W. Storer, Kevin M. Greenan, Darrell D. E. Long and Ethan L. Miller. Secure data deduplication. In Proceedings of the 2008 ACM Workshop on Storage Security and Survivability, October 2008. [6] Fujitsus storage systems and related technologies supporting cloud computing, 2010. [Online]. Available: http://www.fujitsu.com/global/ [7] Q. Sean and D. Sean, Venti: A New Approach to Archival Data Storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, ed. Monterey, CA: USE- NIX Association, 2002, pp. 89-101. [8] D. Bhagwat, K. Eshghi, D.D.E. Long and M. Lillibridge, Extreme Binning: Scalable, Parallel Deduplication for Chunk- based File Backup, in 2009 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Mascots, 2009, pp. 237-245. [9] J. Black. Compare-by-hash: A reasoned analysis, in USENIX Association Proceedings of the 2006 USENIX Annual Technical Conference, 2006, pp. 85-90. [10] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, 2007. URL:hadoop.apache.org/hdfs/docs/current/hdfs_de sign.pdf, accessed in Oct 2011.

7. 5. Conclusion Archival data is growing exponentially so it is much needed to have system which can eliminate data duplication in a best way. Although paper have eloborated Venti in depth and its improvement areas; three major issues of Venti are discussed but there may be the cases when these proposed solutions may fail. For hash case may occurs when SHA and MD5 both create duplicate keys. Similarly in second part, content based chunking is high computational task so it can be avoid by further improvement. Venti is not experimented on distributed enviorement so that can be the idea candidate for future work. 6. References [1] "Deduplication and Compression Techniques in Cloud Design" by Amrita Upadhyay, Pratibha R Balihalli, Shashibhushan Ivaturi and Shrisha Rao 2012 IEEE [2] "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System" by Benjamin Zhu Data Domain, Inc. 6th USENIX Conference on File and Storage Technologies [3] P. Kulkarni, J. LaVoie, F. Douglis and J. Tracey Redundancy elimination within large collections of files. On 2004 in Proc. USENIX 2004 Annual Technical Conference. [4] Dave Russell: Data De-duplication Will Be Even Bigger in 2010, Gartner, 8 February 2010. [5] Mark W. Storer, Kevin M. Greenan, Darrell D. E. Long and Ethan L. Miller. Secure data deduplication. In Proceedings of the 2008 ACM Workshop on Storage Security and Survivability, October 2008. [6] Fujitsus storage systems and related technologies supporting cloud computing, 2010. [Online]. Available: http://www.fujitsu.com/global/ [7] Q. Sean and D. Sean, Venti: A New Approach to Archival Data Storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, ed. Monterey, CA: USE- NIX Association, 2002, pp. 89-101. [8] D. Bhagwat, K. Eshghi, D.D.E. Long and M. Lillibridge, Extreme Binning: Scalable, Parallel Deduplication for Chunk- based File Backup, in 2009 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Mascots, 2009, pp. 237-245. [9] J. Black. Compare-by-hash: A reasoned analysis, in USENIX Association Proceedings of the 2006 USENIX Annual Technical Conference, 2006, pp. 85-90. [10] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, 2007. URL:hadoop.apache.org/hdfs/docs/current/hdfs_de sign.pdf, accessed in Oct 2011.

data deduplication: venti and its improvements

Technology