tshover: a novel coding scheme for tolerating triple disk failures in raid/draid

TSINGHUA SCIENCE AND TECHNOLOGY ISSN 1007-0214 07/49 pp39-44 Volume 12, Number S1, July 2007

TSHOVER: A Novel Coding Scheme for Tolerating Triple Disk Failures in RAID/DRAID*

NA Baoyu (那宝玉)**, ZHANG Yusen (张毓森), LIU Lili (刘丽丽), LIU Peng (刘鹏)†

Institute of Command Automation, PLAUST, Nanjing 210007, China; † Research Center of Military Grid, PLAUST, Nanjing 210007, China

Abstract: This paper presents a novel method, called TSHOVER, for tolerating up to triple disk failures in

RAID/DRAID architectures or others reliable storage systems. TSHOVER is two-dimensional code, which

employs horizontal code and vertical code at the same time with simple exclusive-OR (XOR) computations.

This paper shows the new step ascending concepts used in encoding, and it has the capability of realizing

fault tolerance. TSHOVER has better data recovery ability to those disk network storage systems with rela-

tively more dynamic changes in the number of disks. Compared with RS and STAR code, TSHOVER has

better encoding performance. When updating a data strip, only 6 XOR operations are needed. Both experi-

mental results and theoretical analyses show that TSHOVER has better performance and higher efficiency

than other algorithms.

Key words: redundant array of independent disk (RAID); DRAID; MDS codes; fault tolerance

Introduction

With the explosive increasing of information and data, enterprise data become more and more important. It will result in great economic loss once data get lost or damaged. Researches on how to guarantee the reliabil-ity of enterprise data storage and the rapid recovery of data under lost have become hot topics. Gibson et al.[1-3] have studied the principle of disk driver failure. With the collection of experiment data and analysis of disk failure model, they argue that negative exponen-tial distribution fits the principle of disk driver failure well. The research shows that under natural disasters (earthquake, fire) or war, it is not uncommon for mul-tiple disk drivers to fault simultaneously, and results in

system paralysis. It is critical to take corresponding disk fault tolerating mechanisms for rapid recovery and repair of classified data, and provides support for agen-cies and units that take data as important as their lives.

The concept of redundant array of independent disk (RAID)[1-3] is proposed for reliable storage of data. RAID is a kind of disk redundant technology. It is in-troduced by David Patterson in University of Califor-nia–Berkeley in 1988. RAID has relatively high read/write performance with its own unique character-istics: data in multiple disks can be read simultane-ously which improves disk bandwidth; all disks can implement track seeking in parallel which reduces track seeking time. Besides performance improvement, RAID provides certain degree of fault tolerance. The main stream RAID5 and RAID6 can realize data safely recovery under one or double disk failures.

With the higher and higher requirement on data stor-age reliability, many reliability algorithms on data storage have been proposed to satisfy the recovery of

﹡

﹡﹡

Received: 2007-02-01 Supported by the National Natural Science Foundation of China (No. 60403043) To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-13813823713

Tsinghua Science and Technology, July 2007, 12(S1): 39-44

40

data under multi-disk failures. Blaum et al. introduced EVENODD code[4] in 1995. It is the pioneering coding mechanism for tolerating 2 disk failures. Parity data are achieved only through exclusive-OR operations. Huang and Xu introduced STAR code[5] in 2005 which has extended EVENODD. As MDS codes, it provides tolerance of 3 disk failures. Both EVENODD and STAR are nearly time-optimal. Xu and Bruck. intro-duced X-Code in Ref. [6]. It provides tolerance of 2 disk failures according to certain coding mechanism. It is time-optimal as MDS code. WEAVER code[7] and HoVer code[8] introduced by Hafner in 2005 provided high fault tolerance. It is highly regarded currently for its unique coding style. As a real MDS code, RS code[9-11] ensures data recovery under failures of any m disks in altogether n+m disks. Because of easy imple-mentation, it is widely used both in hardware and software. Plank and Xu introduced Cauchy RS code[12]. It enhanced RS code by abolishing finite fields, im-plementing multiplication and division computation only through exclusive-OR operations, which has greatly improved time performance comparing with RS code.

This paper presents a novel method called TSHOVER for tolerating up to triple disk failures in RAID/DRAID architectures or others reliable storage systems. As an MDS code, TSHOVER is time-optimal and has superior performance over RS code and STAR code according to the comparison.

1 Basic Concepts

We define following concepts. Terms used in this paper are defined as follows if not specially pointed out.

Element a fundamental unit of data or parity; this is the building block of the erasure code. In coding theory, this is the data that is assigned to a bit within a symbol. For XOR-based codes, this is typically a set of sequential sectors and is the maximally sized data unit of an XOR formula.

Strip a unit of storage consisting of all contingu-ous elements (data, parity or both) from the same disk and stripe. In coding theory, this is associated with a code symbol. It is sometimes called a stripe unit. The set of strips in a code instance form a stripe. Typically, the strips are all of the same size (contain the same number of elements).

Stripe a complete (connected) set of data and par-ity elements that are dependently related by parity

computation relations. In coding theory, this is a code word; we use “code instance” synonymously.

Stack Assembly of several stripes. Those stripes contain the same number of strips.

Array A collection of disks on which one or more instances of an RAID erasure code is implemented.

Horizontal code An erasure code in which each strip contains either data elements or parity elements, never both.

Vertical code An erasure code in which a (typical) strip contains both data elements and parity elements.

Step Step means the stride from one data strip to other. We use S to represent it. Figure 1 shows different kinds of steps.

Fig. 1 Sketch of step

2 Encoding/Decoding Procedures of TSHOVER

TSHOVER code is used to solve tolerance of triple disk failures. This novel scheme has improved codes like RAID4, RAID5, and EVENODD whose degree of fault tolerance is one or double. We only use XOR op-erations for encoding and decoding, so TSHOVER code has better time performance.

2.1 Encoding

Data used in encoding scheme are defined as following: D(i, j): The i-th row, j-th column of original data

block, where i=0, 1, ..., r−1, j = 0, 1, ..., n−1; U(k) and V(m): Vertical parity data of the k-th and

m-th disk respectively, where k, m = 0, 1, ..., n−1. H(g)：Horizontal parity data，g = 0, 1, ..., r−1. For example, n=7, r=4. Rules of TSHOVER Encod-

ing scheme are described in Fig. 2. As Fig. 2 describes, each data strip in array contains

three numbers, the former two represent corresponding vertical parity strip, the last number represents corre-sponding horizontal parity strip. Mathematical expres

NA Baoyu (那宝玉) et al：TSHOVER: A Novel Coding Scheme for Tolerating Tripe …

41

2:5:1 3:6:1 4:0:1 5:1:1 6:2:1 0:3:1 1:4:1 H(0)3:4:2 4:5:2 5:6:2 6:0:2 0:1:2 1:2:2 2:3:2 H(1)4:3:3 5:4:3 6:5:3 0:6:3 1:0:3 2:1:3 3:2:3 H(2)5:2:4 6:3:4 0:4:4 1:5:4 2:6:4 3:0:4 4:1:4 H(3)U(0) U(1) U(2) U(3) U(4) U(5) U(6)

V(0) V(1) V(2) V(3) V(4) V(5) V(6)

Fig. 2 Framework of layout for TSHOVER encoding

sions of TSHOVER code are: 1

0( ) ( ,mod ( 2))

(a diagonal of step 1)

r

nkU i D r k i k

−

== ⊕ − + +

= (1) 1

0( ) ( ,mod ( 2))

(a diagonal of step 1)

r

nkV j D r k j k

−

== ⊕ − − −

= − 　 (2) 1

0( ) ( , )

(horizontal code, step 1)

n

kH g D g k

−

== ⊕

= (3)

From formulae (1), (2), and (3), it is clear that each parity strip contains only one strip on other disks, and does not contain data strips on local disk. Therefore, all parity data are independent of each other.

Theorem 1 Coding scheme shown in Fig. 2 pro-duces a (r+2)(n+1) matrix. The matrix can restore any damaged data on 3 disks if and only if r≤n－3.

Proof Altogether 2n vertical parity data produced by formulae (1) and (2), formula (3) produces r hori-zontal parity data. Since parity data produced by dif-ferent code are independent of each other, Equations (1)-(3) composed of 2n+r mutually independent coding equations are produced. The rank of Eqs. (1)-(3) is 2n+r. When three disks lost, the number of strips lost is 3(r+2). Lost data can be recovered by Gauss elimi-nation method only if z≥3(r+2). So 2n+r≥3(r+2), that is r≤n-3, proved.

Only exclusive-OR computation is adopted in cod-ing computation. All parity data are mutually inde-pendent; only three parity data are needed to update when the original data is updated, which powers TSHOVER code with optimal encoding and updating performance. Parity data produced by vertical code are stored in different disks, which makes TSHOVER code have better concurrent storage performance, especially fit for distributed storage systems.

2.2 Decoding

This section discusses how to recover data when disks

in storage system collapsed. For the (r+2)(n+1) matrix, if one disk has failed, assumes disk Dk((0≤k≤n-1), data on disk Dk can be recovered:

( , ) ( ,0) ( ,1) ( , 1) ( , 1) ( , 1) ( ),

0 1

D i k D i D i D i kD i k D i n H i

i r

= ⊕ ⊕ ⊕ − ⊕+ ⊕ ⊕ − ⊕

−≤ ≤

　

Formulae (1) and (2) can be used to recover vertical

parity data on disk Dk. When k=n, data can be recov-ered with formula (3), which is equivalent to a hori-zontal encoding.

If double disk have failed, assumes disk Di and Dj, there are two cases: 1) 0≤i≤n−1, j=n. Since each ver-tical strip only involves one data strip on each data disk, without involving any strip on local disk, either formulae (1) or (2) can be used to recover data on disk Di. Formula (3) can be used to recover data on disk Dj after the recovery of data on disk Di. 2) 0≤i, j≤n−1. It is similar to the data recovery of X-Code code at dou-ble disks failed. No more description.

The focus of this section is the case that triple disks have failed. We have two subcases: 1) Horizontal par-ity disk has not failed. 2) Horizontal parity disk has failed. The former case is harder to decode. The latter case handles a special situation and is much simpler. We discuss the two cases respectively as following. 2.2.1 Horizontal parity data disk has not failed Since our coding scheme is in cycle style, without do-ing harm, makes the 0th disk, i-th disk, and j-th disk be failed. Take vertical parity strip U(1) and V(1) on the first non-failed disk (i.e., the second disk), so the cor-responding encoding equations of U(1) and V(1) are:

(1) ( ,3) ( 1,4) ( 3, )( 3, ) (2, 1) (1, 2)

U D r D r D r i iD r j j D r D r= ⊕ − − +− + + ⊕ +

(1) (1,3) (2,4) ( 2, )( 2, ) ( 1, 1) ( , 2).

V D D D i iD j j D r r D r r

= ⊕ −− − + ⊕ +

Move the lost strip to the left of equation, non-lost strip to the right of equation, two equations describing lost data strip is obtained:

( 3, ) ( 3, ) ( ,3) ( 1,4) (2, 1) (1, 2) (1)

D r i i D r j jD r D rD r D r U

− + ⊕ − + =⊕ − ⊕ ⊕+ ⊕ + ⊕ (4)

( 2, ) ( 2, ) (1,3) (2,4)

( 1, 1) ( , 2) (1)

D i i D j jD D

D r r D r r V

− ⊕ − =⊕ ⊕ ⊕

− + ⊕ + ⊕ (5)


42

For Eqs. (4) and (5), unknown data is in left side, a constant is accepted through computation on known data in right side of equation. Sequentially search on non-lost disks to find equations of corresponding lost strip. Altogether 2(n-3) equations are found till the nth disk. Adding r equations of horizontal code, equations M made up of 2(n-3)+r=2n+r-6 equations is acquired.

( 3, ) ( 3, ) ( ,3) ( 1,4) (2, 1) (1, 2) (1),

( 2, ) ( 2, ) (0,3) (1,4) ( 2, 1) ( 1, 2) (1),

( 4, ) ( 4, ) ( ,4) ( 1,5) (2, 2) (1, 3) (2),

D r i i D r j j D r D rD r D r U

D i i D j j D DD r r D r r V

D r i i D r j j D r D rD r D r U

D

− + ⊕ − + = ⊕ − ⊕⊕ + ⊕ + ⊕

− ⊕ − = ⊕ ⊕ ⊕− + ⊕ − + ⊕

− + ⊕ − + = ⊕ − ⊕⊕ + ⊕ + ⊕

( 3, ) ( 3, ) (0,4) (0,5) ( 2, 2) ( 1, 3) (2),

( 1, ) ( 1, ) ( ,1) ( 1,2) (2, 1) (1, ) ( ),

( 1, ) ( 1, ) (0,4) (1,5) ( 2, 2) ( 1, 3) ( )

i i D j j D DD r r D r r V

D r i i D r j j D r D rD r D r U n

D i i D j j D DD r r D r r V n

⎧

− ⊕ − = ⊕ ⊕ ⊕⎨− + ⊕ − + ⊕

− + ⊕ − + = ⊕ − ⊕⊕ − ⊕ ⊕

− ⊕ − = ⊕ ⊕ ⊕− + ⊕ − + ⊕

⎪⎪⎪⎪⎪⎪⎪⎪⎪

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

3r original data strips are lost in total. Since 2n+r− 6≥2(r+3)+r−6=3r, original data strips can be restored with equations.

We call that original data corresponding to a pair of parity strips U(k) and V(k) form a “cross”, the decod-ing procedures are shown in Fig. 3.

Example 1: Assumes n=11, r=8, data recovery procedure at the lost of disk first, third, and sixth.

Disk array is shown in Fig. 4 according to the condi-tions described in Example 1.

Figure 4 marks vertical parity strip U(1) and V(1) of the first non-lost disk, and the corresponding original data. (1) (3,3) (2,4) (1,5) (0,6),U D D D D= ⊕ ⊕ ⊕ which are all marked by “♠” signs. Similarly, (1)V =

(0,3) (1,4) (2,5) (3,6),D D D D⊕ ⊕ ⊕ which are all

marked by “♥” signs. Each equation contains only one lost data strip, D(1,5) and D(2,5) can be recovered with equation:

(1,5) (3,3) (2,4) (0,6) (1)(2,5) (0,3) (1,4) (3,6) (1)

D D D D UD D D D V

= ⊕ ⊕ ⊕⎧⎨ = ⊕ ⊕ ⊕⎩

Build equations for the third, 4th, and 6th disk, respectively:

Fig. 3 Processing chart of TSHOVER code decoding

Fig. 4 The first, third, and sixth disks failed

(3,5) (2,6) (1,0) (0,1) (3),(0,5) (1,6) (2,0) (3,1) (3),(3,6) (2,0) (1,1) (0,2) (4),(0,6) (1,0) (2,1) (3,2) (4),(3,1) (2,2) (1,3) (0,4) (6),(0,1) (1,2) (2,3) (3,4) (6)

D D D D UD D D D VD D D D UD D D D VD D D D UD D D D V

⊕ ⊕ ⊕ =⎧⎪ ⊕ ⊕ ⊕ =

⊕ ⊕ ⊕ =⎨ ⊕ ⊕ ⊕ =

⊕ ⊕ ⊕ =⊕ ⊕ ⊕ =

⎪⎪⎪

⎪⎪⎪⎪⎩

Corresponding equation of U(6) and V(6) contains only one lost strip D(2,2) and D(1,2), they can be di-rectly recovered. Since D(1,5), D(2,5), D(2,2), and D(1,2) have been recovered already, D(1,0) and D(2,0) can be recovered through horizontal parity strips H(1) and H(2). For U(3), V(3), U(4), and V(4), each one of them corresponds to only one lost strip, respectively. D(0,2), D(0,5), D(3,2), and D(3,5) can be recovered through equations. Last two strips D(0,0) and D(3,0)

NA Baoyu (那宝玉) et al：TSHOVER: A Novel Coding Scheme for Tolerating Tripe …

43

can be recovered with horizontal parity strip H(0) and H(3). Thus, all data have been recovered. 2.2.2 Horizontal parity disk has failed If horizontal parity disk has failed, other double failed disks are in the former n disks. This case is equivalent to the second case of former mentioned double disks failed. Data can be recovered.

3 Performance and Efficiency Analysis

In order to verify the encoding performance of TSHOVER, EVENODD, STAR, and RS algorithms are chosen for comparison. Assumes n disks, the size of each strip is 1 byte. Figure 5 describes encoding per-formance of EVENODD, STAR, RS, and TSHOVER.

Fig. 5 EVENODD, RS, TSHOVER, and STAR en-coding performance

As shown in Fig. 5, with the increase of n, the num-ber of XOR operations needed for all the coding algo-rithms is also increasing. Because RS code needs con-version of finite fields, the number of XOR operations increases sharply with the increase of disk quantity, much faster than other algorithms. It has the lowest performance. EVENODD needs only double parity disks, which needs the least number of XOR opera-tions. STAR needs to compute compensation factor S1 and S2. For the second and the third parity disks, each parity strip should multiple corresponding compensa-tion factors; therefore, needs more number of XOR op-erations than TSHOVER and EVENODD. For TSHOVER code, r=n−3, the number of XOR opera-tions needed to compute horizontal parity strip is: 8(n−3)(n−1); the number of XOR operations needed to compute U and V is nn )4(82 −× . Therefore, alto-

gether 8 ( 3) ( 1) 2 8 ( 4)n n n n× − × − + × × − × = 28(3n −

12 3)n + XOR operations are needed. Data strip update performance of each algorithm is

shown in Table 1.

Table 1 EVENODD, RS, TSHOVER and STAR up-date performance

AlgorithmsNumber of XOR

operations/bit Fault

toleranceRS 6 3

Not part of the calculation of S: 4 EVENODD

Part of the calculation of S: 2p+2 2

Not part of the calculation of S1 and S2: 6 Part of the calculation of S1 or S2: 2p+4

STAR

part of the calculation of S1 and S2: 4p+2

3

TSHOVER 6 3

Table 1 shows the number of XOR operations needed to implement a single bit update. For RS and TSHOVER codes, triple parity strips are needed to up-date when updating a data strip. Altogether 6 XOR op-erations are needed. There are two cases for EVENODD algorithm: (1) If updated data are not on compensation factor S, we only update one horizontal parity strip and one diagonal parity strip that cross it. Altogether 4 XOR operations are needed. (2) If up-dated data are on compensation factor S, S and all par-ity strips that were computed with factor S need to be updated. Altogether 4+2(p−1) = 2p+2 updates. There are three cases for STAR algorithm: (1) If updated data strip is not in compensation factor S1 and S2, only a horizontal parity strip and two diagonal parity strips that cross it need to be updated. Altogether 6 XOR op-erations are needed. (2) If updated data strip is on compensation factor S1 or S2, the corresponding hori-zontal parity strip, corresponding compensation factor and all parity strip that were computed with this com-pensation factor need to be updated. Altogether 2p+4 updates are needed. (3) If updated data strip is on re-vise factors S1 and S2, the corresponding horizontal par-ity strip, S1, S2 and all parity strips that were computed with S1 and S2 are need to be updated. Altogether 4p+2 updates are needed.


44

4 Conclusions

This paper has illuminated TSHOVER code for fault tolerance up to triple disk failures. This algorithm works well for disk recovery in RAID or DRAID ar-chitectures. Its storage efficiency and performance are better than RS, EVENODD, and STAR codes. A con-cept of “step increasing” is introduced, which can help to achieve high fault tolerance disk array or better net-work storage data recovery. This paper only simply adoptes one step increasing to solve triple disk fault tolerance. The algorithm can be easily extended to provide higher fault tolerance through step increasing.

References

[1] Patterson D A, Gibson G A, Katz R H. A case for redun-dant arrays of inexpensive disks (RAID). In: Proc. of In-ternational Conference on Management of Data (SIG-MOD), Chicago IL: ACM Press, 1988: 109-116.

[2] Chen P M, Lee E K, Gibson G A. RAID: high-performance, reliable secondary storage. ACM Computing Surveys, 1994, 26(2): 145-185.

[3] Gibson G A. Redundant Disk Arrays. Reliable, Parallel Secondary Storage. Cambridge: The MIT Press, 1992.

[4] Blaum M, Brady J, Bruck J. EVENODD: An efficient scheme for tolerating double disk failures in RAID archi-tectures. IEEE Transactions on Computing, 1995, 44(2): 192-202.

[5] Huang Cheng, Xu Lihao. STAR: An efficient coding scheme for correcting triple storage node failures. In: FAST-2005: 4th Usenix Conference on File and Storage Technologies, December, 2005.

[6] Xu Lihao, Bruck J. X-Code. MDS array codes with opti-mal encoding. IEEE Transactions on Information Theory, 1999, 45(1): 272-276.

[7] Hafner J L. WEAVER codes: Highly fault tolerant erasure codes for storage systems. In: FAST-2005: 4th Usenix Conference on File and Storage Technologies, December, 2005

[8] Hafner J L. HoVer: Erasure codes for disk arrays. Research Report RJ10352 (A0507-015). IBM Research Division, July, 2005

[9] Wicker S B, Bhargava V K. Reed-Solomon Codes and Their Applications. New Park: IEEE Press, 1994.

[10] Plank J S. A tutorial on reed-solomon coding for fault-tolerance in RAID-like systems. Software—Practice & Experience, 1997, 27(9): 95-1012.

[11] Plank J S, Ding Ying. Note: Correction to the 1997 tutorial on reed-solomon coding. Software—Practice & Experi-ence, 2005, 35(2): 189-194.

[12] Plank J S, Xu Lihao. Optimizing cauchy reed-solomon codes for fault-tolerant network storage applications. In: The 5th IEEE International Symposium on Network Com-puting and Applications (IEEE NCA06), Cambridge, MA, July, 2006.

tshover: a novel coding scheme for tolerating triple disk failures in raid/draid

Documents