on the effect of a configuration choice on the performance of a mirrored storage system

14
J. Parallel Distrib. Comput. 65 (2005) 382 – 395 www.elsevier.com/locate/jpdc On the effect of a configuration choice on the performance of a mirrored storage system Eitan Bachmat a , , Tao Kai Lam b a Department of Computer Science, Ben-Gurion U., Beer Sheva 84105, Israel b EMC Corporation, 228 South St., Hopkinton, MA 01748, USA Received 27 December 1999; received in revised form 24 August 2004 Available online 5 January 2005 Abstract In this paper, we study the effect of various configuration choices on the performance of a mirrored disk array.We introduce a large class of semistructured configurations which provide good seek minimization and bandwidth. We study their properties with respect to both online and offline mirroring algorithms. We also prove a theorem concerning the load balancing capabilities of random or semistructured configurations with several data copies. We show that almost the entire bandwidth of the storage system can be exploited once we increase the number of copies. In an appendix we explain a common error which appears in much of the literature concerning this subject. © 2004 Elsevier Inc. All rights reserved. Keywords: Mirrored storage systems; Performance analysis; Data layout 1. Introduction A Mirrored storage system is a storage system which con- tains more than one copy of each data element, different copies residing on different storage devices within the sys- tem. A configuration is a choice of data layout on the storage devices. In this paper, we study the effects of a configura- tion choice on the performance behavior of a mirrored stor- age system. The performance of mirrored storage systems has been the subject of many papers. Papers which study seek minimization in mirrored systems [12,21], usually as- sume that the system consists of pairs of identical disks and the two data copies reside at identical locations on the two disks. Such a configuration is known as physical mirroring or disk shadowing. Other configurations which arrange the data copies at different locations or use more disks have not been explicitly explored in the context of seek mini- mization. In other studies which explore the bandwidth and Corresponding author. E-mail addresses: [email protected] (E. Bachmat), [email protected] (T.K. Lam). 0743-7315/$ - see front matter © 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2004.11.003 load balancing capabilities of mirrored systems [15,31,32], it is usually assumed that the configuration is random, mean- ing that the location of the two data copies is arbitrary and independent of each other, again other configurations have not been explicitly considered. As will be shown physical mirroring is indeed a good, but not optimal, configuration for seek minimization, on the other hand it is essentially the worst configuration at providing high bandwidth. Con- versely, random configurations turn out to be optimal at pro- viding high bandwidth but are essentially the worst config- urations when it comes to seek minimization. In this paper, we introduce a large new class of config- urations which we term semistructured configurations.A generic configuration in this class is a hybrid of the physi- cally mirrored configuration and a random configuration. We will show that these configurations inherit the good proper- ties of both the physically mirrored and random configura- tions. They are good at both seeking minimization and load balancing. We will also consider some other configurations which have been suggested in the past such as interleaved declustering [33] and group rotate [15], as well as mirrored systems with more then two data copies. The latter have

Upload: eitan-bachmat

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

J. Parallel Distrib. Comput. 65 (2005) 382 – 395www.elsevier.com/locate/jpdc

On the effect of a configuration choice on the performance of a mirroredstorage system

Eitan Bachmata,∗, Tao Kai Lamb

aDepartment of Computer Science, Ben-Gurion U., Beer Sheva 84105, IsraelbEMC Corporation, 228 South St., Hopkinton, MA 01748, USA

Received 27 December 1999; received in revised form 24 August 2004Available online 5 January 2005

Abstract

In this paper, we study the effect of various configuration choices on the performance of a mirrored disk array. We introduce a large classof semistructuredconfigurations which provide good seek minimization and bandwidth. We study their properties with respect to bothonline and offline mirroring algorithms. We also prove a theorem concerning the load balancing capabilities of random or semistructuredconfigurations with several data copies. We show that almost the entire bandwidth of the storage system can be exploited once we increasethe number of copies. In an appendix we explain a common error which appears in much of the literature concerning this subject.© 2004 Elsevier Inc. All rights reserved.

Keywords:Mirrored storage systems; Performance analysis; Data layout

1. Introduction

A Mirrored storage system is a storage system which con-tains more than one copy of each data element, differentcopies residing on different storage devices within the sys-tem. Aconfigurationis a choice of data layout on the storagedevices. In this paper, we study the effects of a configura-tion choice on the performance behavior of a mirrored stor-age system. The performance of mirrored storage systemshas been the subject of many papers. Papers which studyseek minimization in mirrored systems[12,21], usually as-sume that the system consists of pairs of identical disks andthe two data copies reside at identical locations on the twodisks. Such a configuration is known asphysical mirroringor disk shadowing. Other configurations which arrange thedata copies at different locations or use more disks havenot been explicitly explored in the context of seek mini-mization. In other studies which explore the bandwidth and

∗ Corresponding author.E-mail addresses:[email protected](E. Bachmat),

[email protected](T.K. Lam).

0743-7315/$ - see front matter © 2004 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2004.11.003

load balancing capabilities of mirrored systems[15,31,32],it is usually assumed that the configuration israndom, mean-ing that the location of the two data copies is arbitrary andindependent of each other, again other configurations havenot been explicitly considered. As will be shown physicalmirroring is indeed a good, but not optimal, configurationfor seek minimization, on the other hand it is essentiallythe worst configuration at providing high bandwidth. Con-versely, random configurations turn out to be optimal at pro-viding high bandwidth but are essentially the worst config-urations when it comes to seek minimization.

In this paper, we introduce a large new class of config-urations which we termsemistructured configurations. Ageneric configuration in this class is a hybrid of the physi-cally mirrored configuration and a random configuration. Wewill show that these configurations inherit the good proper-ties of both the physically mirrored and random configura-tions. They are good at both seeking minimization and loadbalancing. We will also consider some other configurationswhich have been suggested in the past such as interleaveddeclustering[33] and group rotate[15], as well as mirroredsystems with more then two data copies. The latter have

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 383

become increasingly feasible in recent years due to lowerstorage prices.

1.1. Organization of the paper and results

The paper is organized as follows:In Section 2, we define some of the standard configura-

tions which have been introduced in the literature and in-troduce the new class of semistructured configurations. Wealso consider briefly the seek function and user access pat-tern which we shall be using.

In Section 3 we consider seek minimization as the tar-get function. There are two types of algorithms whichcan be used to attain the goal of seek minimization, on-line algorithms and offline algorithms. We consider avery standard online algorithm,the nearest server algo-rithm. We present simulation evidence that the optimalconfiguration is obtained by exchanging the locations ofthe data in the outer and inner radial halves of the disk.We also show that semistructured configurations clearlyoutperform random configurations which seem to be theleast desirable configurations from a seek minimizationperspective.

We then consider offline algorithms. In this setting weshow that semistructured configurations are precisely theclass of configurations which allow optimal seek minimiza-tion. This result which is proved for any number of datacopies is in fact the one which motivated the definition ofthis class of configurations and the entire paper. We alsocompare the performance of online and offline algorithmsand show that they do not differ by much. We end the sec-tion with an explanation of why it is interesting to considergeneric elements in the class of semistructured configura-tions and not just some well chosen examples which displaygood performance.

In Section 4 we consider load balancing or bandwidth asthe target function. Following the line of research initiatedin [31,32], we introduce a new metric which measures theload balancing capabilities of a configuration. We show thatrandom and semistructured configurations display optimalload balancing capabilities with respect to this metric. An-other configuration, group rotate, which was introduced in[15] is also shown to be optimal. On the other hand phys-ical mirroring is shown to be very bad at load balancing.We then show that as the number of data copies becomeslarge the entire potential bandwidth of the storage systemcan be exploited concurrently and efficiently provided thatthe configuration is random or semistructured. Experimentalevidence suggests that 3 copies are in fact enough.

We conclude from all our results that semistructured con-figurations strike a good balance between the properties ofphysical mirroring and random configurations.

In an appendix to the paper we explain a common errorwhich appears in many of the previous works on onlinealgorithms for mirroring.

1.2. Related work

The seek minimization properties of physical mirroringwith 2 disks with respect to the nearest server algorithm havebeen studied extensively in[12,21]. As we noted there aremany other papers which deal with this and related subjectswhich contain a basic error[5–7,9,10,26,25,28,29]. Amongthose whose results remain qualitatively correct despite theerror one should note[25] which considers synchronouswrites as well as reads and[26] which provides a compar-ison of online and offline algorithms all assuming physicalmirroring.

Random configurations in the context of load balancingare explored in[32,31,1]. The techniques which we use inour discussion of load balancing are similar to those whichare presented in these references.

Load balancing and reliability properties of mirroredconfigurations have also been studied extensively in thespecific context of multimedia applications, mostly usingexperimental methods[3,8,14,18,19,24,27]. Bandwidth isthe metric of choice in these applications and seek mini-mization is not considered.

Chen and Towsley[15], experimentally study the prob-lem of choosing a configuration for a mirrored system. Theirresults are summarized in Fig. 11 in loc. sit. one of themotivations of the current paper is to provide a theoreticalframework which will help explain some of the phenomenonobserved in[15]. In addition our results can be used to an-alyze some of the configurations they have not studied.

2. Configurations for mirrored systems

In this section we introduce the configurations which willbe considered in the paper. Along with some well-knownconfigurations we introduce the notion of a semistructuredconfiguration. We will also briefly discuss the seek functionand user access pattern which will be used in our experi-ments and results.

2.1. Configurations

Let n, h, d be integers such thatnh is divisible byd. LetL = nh/d. A configurationwith parameters(N, h, d), con-sists of anh by nmatrixM whose entries are integers in therange 0, . . . , L − 1 subject to the following constraints:

(A) Each number 0∈ {1, . . . , L} appears exactlyd timesin the configuration matrix.

(B) The entries in any given column are distinct.It will be convenient to index the rows and columns of theconfiguration matrix by 0, . . . , h − 1 and 0, . . . , n − 1.

The configuration should be interpreted as follows. Thesystem is comprised ofn disks, indexed 0, . . . , n − 1.The system also hasL files (data units) which are indexed0, . . . , L− 1. and there ared copies of each file on differentdisks.

384 E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395

Each disk is divided intoh ring-shaped radial zones, whichcan each hold a single file. We index these radial zones fromthe outer most to the inner most by indexes 0, . . . , h − 1.Each column of the configuration matrix describes a singledisk. Thekth entry in the column is the index of the filewhose data resides in thekth radial zone of the disk.

To summarize, each column states which files reside ona given disk and in what order, starting from the outsidecircumference. This order is compatible with the standardindexing of physical blocks in disk drives.

We introduce some of the configurations which we willstudy in this paper. Whend is not explicitly noted we assumethatd = 2.

(1) Physical mirroring: (Folklore)We define theh × d matrix P = Ph,d by the formula

Ph,di,j = i.In this configuration alld disks contain identical data at

identical locations.(2) Interleaved declustering: (introduced in[33])Let h be even. We define theh × (h + 1) matrix I = Ih

by the formulaIhij = i + hj/2 for i < h/2 and

Ihij = Ihi−h/2,j−((i+1−h/2)modh)+1 for i�h/2.For example, whenh = 4 we obtain the configuration

0 2 4 6 81 3 5 7 98 0 2 4 67 9 1 3 5

We note that each pair of disks share a single file in com-mon in this configuration.

(3) Group rotate: (introduced in[15])We define theh × 2h matrix G = Gh by the formula

Ghi,j = j + ih for j < h,

Ghi,j = Gh

i,j−h−i modh for j�h.Whenh = 3 we obtain

0 3 6 0 3 61 4 7 7 1 42 5 8 5 8 2

(4) Chained declustering: (introduced in[22])Let h be even. We define theh × n matrix C = Ch,n by

Ch,ni,j = i + jh/2 for i < h/2,

Ch,ni,j = C

h,ni−h/2,j+1 modn for i�h/2.

Whenh = 4 andn = 4 we get0 2 4 61 3 5 72 4 6 03 5 7 1

2.1.1. Semistructured configurationsWe introduce a large family of configurations which we

call Semistructured. We assume strictly for the purpose ofsimplicity thatd dividesh.

Definition. A configurationM is said to be semistructured ifthere exists a vectorQ = (q0, . . . , qN−1) with 0�qi �h −

h/d such that the setDQ = ∪Ni=1

{Mi,qi ,Mi,qi+1, . . . ,

Mi,qi+(h/d)−1}

equals the set{0, . . . , L − 1}.

If M is semistructured with respect toQ we let

BQ = ∪Ni=1 {(i, qi), . . . , (i, qi + (h/d) − 1)} .

By definition the set of locationsBQ contains a single copyof each file.

We provide some examples of the newly definedsemistructured configurations.

(A) Ph,2 is semistructured with respect to two symmetri-cally relatedQ vectors,Q1 = (1, h/2 + 1), Q2 = (h/2 +1,1). More generallyPh,d is semistructured.

(B) Ih andCh,n are both seen to be semistructured withthe choiceQ = 0.

(C) Gh is semistructured with the choiceQi = 0 for i =0, . . . , h − 1 andQi = h/2 otherwise.

The configuration7 5 72 9 91 6 25 0 86 3 03 8 5is an example of a “generic” semistructured configurationwith a uniqueQ vector(2,3,0).

while, the configuration0 01 22 13 3provides a very simple example of a nonsemistructured con-figuration.

2.2. Seek function and user access pattern

In order to study seek minimization we need to specifythe amount of time it takes the disk to seek between files. Aseek function has the general formf (|i − j |) wherei andj refer to the radial location indices of the files on the disk.f is obviously an increasing function since it longer seeksmore time. In our simulations, we use the linear seek func-tion modelf (|i − j |) = |i−j |

(h−1) . This model is by far themost popular model in analytic studies of storage systemperformance[4,12,13,21,34]and others. Using other non-linear functions will only change the numerical values ofthe computed average seek, but not the relative performanceof different configurations. The normalization factor1

h−1 al-lows us to compare systems with different values ofh sincea full disk seek from the outer most file to the inner mostfile always takes 1 unit of time.

Since our goal is to study the effect of a configurationchoice we will use the simplest and most popular modelof a user’s access pattern. We assume that the user sendsrequests independently of each other to uniformly distributed

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 385

random locations in the system. This is a homogeneous ac-tivity model, and only reads profit from a mirroring algo-rithm, therefore we assume a read-only access pattern.

3. Seek minimization

In this section, we consider the effect of a configurationchoice on seek minimization. There are two types of seekminimization algorithms which may be employed. The firsttype is the family ofonline algorithms. These algorithmsdecide which data copy will be used to service a readI/O

request based on the current position of the disk heads inthe system. The prototypical example of an online algorithmis theNearest serveralgorithm, NS for short, which at anygiven moment chooses the head closest to a data copy, toservice the request. Such algorithms may be very efficientbut they require knowledge of the head location of severaldisks in the system (those containing copies of the requireddata). Such information may be difficult to obtain in realtime especially in distributed systems. This problem leads tothe consideration of a second type of mirroring algorithmswhich areoffline. In such algorithms, one chooses for eachcopy 0� i�d − 1 of each file 0�j�L − 1 a probabilitypi,j that a read request from filej will be serviced by theith copy of the file. When a read request to filej enters thesystem it is serviced by theith copy of the file with prob-ability pi,j regardless of the current head positions in thesystem. The choice of probabilities can change periodicallywhen new data on the user access pattern is received. Suchalgorithms are very easy to implement in distributed sys-tems and require minimal control, so they can even be im-plemented outside the system. Online algorithms by theirreal time nature are very difficult to implement using exter-nal controls. In the next two subsections we consider bothtypes of algorithms and how the choice of a configurationaffects their performance. We note that both types have beencommercially implemented in disk arrays.

3.1. Online algorithms

We describe the average seek of the NS algorithm withrespect to a configurationM. We assume thatd = 2. Withthe exception of the work of Calderbank et al.[12,13], allstudies of online mirroring algorithms have been incorrect.This is explained in the appendix.

Recalling that the user access pattern we employ is anindependent request model, the nearest server algorithm im-plemented on a system with configurationM is describedby a Markov process, which we denoteNSM . The states ofNSM are given by the file positions of all disk heads. Suchstates correspond to vectors of lengthn with values in therange 0, . . . , h−1. The system is in stateS = (s0, . . . , sh−1)

if the head of theith disk is currently located at filesi onthe disk, where the files on the disk are indexed from theoutermost to the innermost radially.

We refer the reader to[17] for the basic theory of Markovchains, including the notions of periodicity, ergodicity, sta-tionary distributions and the like.

It is easy to show that for any configurationM, NSM hasno periodicities. For some configurations such asPh,d it isnot ergodic. We will comment on this phenomenon shortlybut for now assume thatM is a configuration for whichNSMis ergodic. In this case there will be a unique stationaryprobability distribution�M on the finite set of all possiblestatesS.

For l = 0, . . . , L − 1 andk = 0, . . . , d − 1 let il,k, jl,kbe the locations of the different copies of thelth file, that is,Mil,k,jl,k = l for all 0�k�d − 1. Let S = (s0, . . . , sn−1)

be the state of the system at some given time and let thenew request be to filel. Since we assumed that the accessprobability is uniform, the probability thatl will be chosenis 1

L. Let r(S, l) = Mink |sil,k − jl,k| be the seek distance,

given the current stateS, that the head chosen by the nearestserver algorithm will cover to service the request to filel.The average seek of the NS algorithm with configurationMwill be

E(M) =∑S

∑l

1

L�M(S)r(S, l).

Since the number of states,hN is usually very large it isusually very difficult to compute�M and hence this formulais not used for computations. Instead we resort to simula-tions ofNSM in which we record the average observed seekdistance.

We still must address the issue of ergodicity. In noner-godic Markov processes there are several basic stationarydistributions�1, . . . ,�f , corresponding to the different re-current sets of states. In such cases we getf different for-mulas for the average seek, one for each stationary distribu-tion. In terms of the performance of the system this meansthat the average seek may depend (probabilistically) on theinitial state of the system. The Markov chains arising fromvarious possible configurations are not necessarily ergodic.

For example ifh�d thenNSPh,d hasd! ergodic com-ponents, however they all turn out to be equivalent and toyield the same average seek distance hence the average seekis well defined and independent of the initial state of thesystem.

The following facts regarding ergodicity of the Markovchain can be proved:(1) It is easy to produce examples withd = 4 of config-urationsM which do not have a well defined average seekdistance.(2) All explicit configurations appearing in this paper dohave a well defined average seek distance.(3) It can be shown that a randomly chosen configurationhas a well defined average seek with very high probability(ash tends to infinity).(4) We conjecture that any configuration withd = 2 has awell defined average seek. This conjecture can be proved forsystems with 2 or 3 disks.

386 E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395

Given the evidence in items 2–4 this issue has no effecton the results reported in this paper.

3.1.1. Results for 2 disk systemsWe consider systems containing only two disks which

contain the same files but not necessarily in the same order.Larger systems can be built from 2 disk systems by repeatingthe configuration, as in large shadowed systems in whichdisks are paired and physically mirrored. Such systems arecommercially common.

Since we assume a uniform access pattern, permuting thenames of the entries in a configurationM does not change thebehavior of the system, hence we may assume thatMi,0 = i.We denote by� the permutation that satisfiesMi,1 = �(i),hence the choice of a configuration amounts to choosing apermutation. If the chosen permutation is the identity weobtain physically mirrored disks.

As an example consider the caseh = 6. In this case thereare 720 possible permutations. Inverting the order of appear-ance of file copies on the second disk has no effect on theresults since it is a distance preserving operation, thus weneed to consider only 360 permutations. We simulated all360 permutations, the permutation that produces the lowestaverage seek was� = (3,4,5,0,1,2) which givesE(M) =0.151. Note that� corresponds to the chain declustered con-figurationC6,2. C8,2 was also the best among configurationswith h = 8. The analogue for largerh isCh,2, for h = 20 theaverage seek was 0.146. The identity forh = 20 produced0.162. We conjecture thatCh,2 is optimal for all values ofh.

We note that the result forC20,2 is lower then the aver-age seek for an optimal online mirroring algorithm (not NS)which uses the physically mirrored configuration. The aver-age seek for the optimal algorithm which was computed in[5] is 0.159.

We compared physical mirroring with generic semistruc-tured configurations and completely random configurations.The generic semistructured configurations were drawn usinga randomQ vector, writing the numbers 0, . . . , L−1 in con-secutive order inBQ and then randomly placing 0, . . . , L−1at the other positions. Completely random configurationswere drawn by choosing a completely random permutation.We chose for our initial experimentsh = 30. We chose 1000random configurations. For each configuration we simulatedNS until the value of the average seek seemed to settle ona given value up to a 0.005 shift. The following list sum-marizes the results. The left column represents a range forthe average seek and the right column counts the number ofrandom configurations whose average seek fell within thatrange.0.175–0.180 30.180–0.185 370.185–0.190 1450.190–0.195 3170.195–0.200 3500.200–0.205 1400.205–0.210 8

We then chose 1000 generic semistructured configurations.The corresponding table looks as follows:0.160–0.165 1140.165–0.170 7930.170–0.175 93Physical mirroring yielded an average seek of 0.161. Thecomputation for physical mirroring was initially performedin [12]. We cannot use directly the value of 0.1625 givenin [12] since that is the value for very largeh and sinceour NS procedure differs slightly from that of loc.sit.The average seek in physical mirroring was incorrectlycomputed in [9] and other papers to be in the range0.2–0.21. The reason for the error in[9] is explained inthe appendix.

Our main observation is that each semistructured con-figuration yielded a lower average seek than each randomconfiguration. We also observes that physical mirroring andCh,2 are among the best semistructured configurations.

Repeating this experiment withh = 100. The range forthe average seek of random configurations was 0.188–0.205.The range for generic semistructured configurations was0.166–0.173.

We now try to qualitatively explain the data. We claimthat ash tends towards infinity the average seek distancefor almost all configurations will tend to a single limit,hence the performance of all randomly chosen configura-tions will tend to converge. In fact ash tends to infinitythe Markov chain associated with a random configura-tion will converge to a unique Markov chainMrand. Thestates ofMrand are points(x, y) in the unit square. Thestate transition inMrand is as follows. A point(x′, y′) isuniformly chosen in the unit square. If|x − x′|� |y − y′|we move from(x, y) to (x′, y). Otherwise we transitionto (x, y′). The transition cost isMIN(|x − x′|, |y − y′|).The points x′, y′ are the continuous limit for the pairi

h−1,�(i)h−1, wherei is chosen uniformly and� is a random

permutation.Mrand is closely related to the CNN problem which was

studied in [23] in the context of competitive algorithms.Simulations indicate that the average seek ofMrand is about0.193, in line with our results on the finite approximationrandom configurations.

For generic semistructured configurations the limitingprocesses are different, since there is a relation betweeniand �(i). More precisely ifi is contained inBQ then it’scopy onD1 is not in BQ and vice versa. If we consider afamily of generic semistructured configurations, withh −→∞ anda = q0/h, b = q1/h fixed, the limiting process willbeMa,b whose transition rule is the following. Given a state(x, y) choosex′ uniformly. If x′ ∈ [a, a + 1/2] choosey′uniformly outside the range[b, b + 1/2], otherwise choosey′ uniformly within [b, b + 1/2]. Proceed as inMrand. Wedid not study the dependency of the average seek ofMa,b

on a andb, but our previously stated empirical results sug-gested that the dependence is not too great and is within therather narrow range of 0.166–0.173.

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 387

3.1.2. Results for systems with more disksWe conducted some experiments on systems with more

than two disks. The results convey the same spirit as those ofthe two disk experiments, however the average seek numberstend to be higher. The configurationI8 which has 9 diskshad an average seek of 0.175 andG8, with 16 disks, yieldeda result of 0.193.

We also performed empirical tests to measure the averageseek of configurations on 10 disks withh = 30. Randomconfigurations were in the range 0.197–0.214, with the ma-jority around 0.205. As in the case of two disks we have alimit for the average seek of a random configuration whichdepends only on the number of disks in the configuration.

Generic semistructured configurations again performedbetter in the range 0.170–0.177 with the majority around0.174. These numbers show that for both 2 and 10 diskssemistructured configurations lie in terms of average seekabout a third of the way between physical mirroring and ran-dom configurations. We do not expect the numbers to varymaterially beyond 10 disks.

Finally, we note that there are other seek minimizationalgorithms for which it would be interesting to repeat suchexperiments. The prime example being thedouble coveralgorithm or DC for short, see[11] for a full description. In aworst case setting (competitive analysis), it is far superior toNS, however in the distributional setting which we consider,it is not as good. On a physically mirrored configuration ityielded an average seek of 0.18…

3.2. Offline algorithms

Offline mirroring algorithms are algorithms which do notrequire knowledge of the system’s current state to make achoice of the copy which will service a request. We do allowthe choice to be probabilistic. This leads to the followingdefinition.

Definition. An offline mirroring algorithmconsists of thechoice of a matrixW, called theweight matrix, of sizen×h.W has the following two properties:(1) wi,j �0.(2) Let (il,1, jl,1), . . . , (il,d , jl,d ) be the locations of filel,that is,Mil,k,jl,k = l for all 1�k�d, then

∑dk=1 Wil,k,jl,k =

1, for all l.

GivenW, a read request for data from filel will be directedtowards thekth copy of the file, which resides on diskil,k inpositionjl,k, with probabilityWil,k,jl,k . Condition (2) ensuresthat the sum of the probabilities is 1.

If M is semistructured with respect to a vectorQ, we candefine a particular weight matrix by the ruleWi,j = 1 if andonly if (i, j) ∈ BQ. The offline mirroring algorithm whichthis weight matrix describes will be called apartition al-gorithm. When the semistructured configuration is physicalmirroring then this algorithm coincides with the classicalpartition algorithm.

Since we assume that the activity of all files is equalwe may normalize it to be 1. Whenh is fixed, all partitionalgorithms have the same average seek which is

EParth = 2d2

h2(h − 1)

(h/d)−1∑k=1

((h/d) − k)k

regardless of the specific semistructured configuration used.It is easy to verify that ash tends to infinity the average seektends to 1/(3d).

Given a configurationM and an offline mirroring algo-rithm W it is easy to show[34] that the average seek overall disks in the system is

E(M,W)=n∑

i=1

EDi

= 2n∑

i=1

m>j

Wi,mWi,j f (m − j)

/

aDi.

Here Di refers to theith disk, EDiis the contribution

of the ith disk, f refers to the seek function andaDi=∑h

j=1 Wi,j /L is the activity of theith disk. LetS(M,W) =E(M,W)/(2L) and correspondinglySDi

= EDi/(2L). For

convenience, we will useS(M,W) to compare configu-rations and offline algorithms instead of the average seekE(M,W), since they are proportional the comparison willnot be affected. Letf be a seek function. All we will requireis that f is increasing. Givenf the seek time between theith and jth files on the disk isf (|i − j |). In the previoussection we used the functionf (x) = x/(h − 1).

The following result shows that semistructured configu-rations equipped with the partition algorithm are optimalamong all pairs consisting of a configuration and an offlinemirroring algorithm.

Theorem 1. Fix h. Given a configuration M with parameterh and an offline algorithm W thenE(M,W)�E(Part)h,equality holds if and only if M is semistructured and W is apartition algorithm with respect to a Q vector.

Proof. We consider the restriction of the configurationMand the weight matrixW to a single diskDk in the system.The weight matrix restricted to the files on the particulardisk can be described by a weight vector�w = (w1, . . . , wh)

wherewi = Wk,i . The contribution of the disk to the seekestimateS(M,W) will be denoted byS( �w), to emphasizethe dependence on the vectorw. We have

S( �w) =∑

i<j wiwj∑i wi

f (j − i).

We need the following simple lemma:

Lemma 2. For 0 < x�h considerV (x), the set of all vec-tors �w with entries0�wj �1 and satisfying

∑i wi = x.

388 E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395

A weight vector�w that minimizes S among all weight vectorsin V (x) must satisfy the following conditions:There exists an index q, 1�q�h − [x] such that

(1) wj = 1 for q�j�q + [x] − 1(2) Either wq−1 or wq+[x] equalsx − [x], and hence all

other weights are zero.

That is, �w = (0, . . . ,0, x − [x],1,1, . . . ,1,0,0, . . . ,0) or(0, . . . ,0,1,1, . . . ,1, x − [x],0,0, . . . ,0).

Proof. Let �w be any weight vector inV (x) that does notsatisfy the conditions in the lemma. Letr be the largest indexfor which the corresponding entrywr is nonzero. Then thereexistsq, 0 < q < r such thatwq < 1 andwq+1 = wq+2 =· · · = wr−1 = 1. Lety = min(1−wq,wr) and denote by�vthe following perturbed weight vector:

vi =wq + y if i = q,

wr − y if i = r,

wi otherwise.

One now easily verifies the following formula:

S( �w) − S(�v)

= y

x

q−1∑

j=1

wj (f (r − j) − f (q − j))

+f (r − q)min(wq,1 − wr)

�0.

Equality holds ifwj = 0 for all j < q and eitherwr =1 or wq = 0. Hence the weight vector which attains theminimum must have the required form.�

Using Lemma2, we obtain easily that the minimum seekfor vectors inV (x) is

S(x) =[x]∑k=1

(1 − k

x

)f (k).

Since the weight vector�w of partition restricted to a singledisk is of the form given in the lemma we see that thecontribution of a single disk to the seek estimate of partitionis S(h/d), h/d being the weight of each disk in partition.

We wish to study the dependence ofS(x) on x.Let g(k) = f (k) − f (k − 1) then

S(x)=[x]∑k=1

(1 − k

x

) k∑j=1

g(j)

=[x]∑j=1

[x]∑

k=j

(1 − k

x

) g(j).

we wish to find the minimum of∑

i S(ai) over all sequencesa1, a2, . . . , an, whereai are nonnegative real numbers such

that∑

i ai = L. Let Hj(x) = ∑[x]k=j (1− k/x). It is easy to

verify the following properties:(1) Hj(x) is an increasing function;(2) Hj(x) = 0 if x < j ;(3) Hj(x) = [x] − j + 1 − [x] − j + 1)([x] + j)/(2x) if

x�j ;

The following lemma deals with the restriction of the systemto two disks.

Lemma 3. Let a < b be two nonnegative numbers andconsider the sumHj(a) + Hj(b).

(1) If a, b are both not integers, then there existsa′, b′ suchthat

Hj(a) + Hj(b)�Hj(a′) + Hj(b

′)

a′ or b′ is an integer anda + b = a′ + b′.(2) If a, b are both integers, then

Hj(a) + Hj(b)�Hj(a + 1) + Hj(b − 1).

Proof.(1) Case1: a < b < j

Hj (a) + Hj(b)

=

Hj(0) + Hj(a + b) if a + b < 1;Hj([a] + 1)

+Hj(a + b − [a] − 1) otherwise.

Case2: a < j < b

Hj (a) + Hj(b)

> Hj ([a] + 1) + Hj(b + a − ([a] + 1))

sinceHj(a) = Hj([a] + 1) = 0 andHj(x) is an in-creasing function.Case3: j < a < b. Let x1 = min(a − [a], [b] + 1− b)

andx2 = min([a] + 1 − a, b − [b]). We consider thefunctionF(x) = Hj(a+x)+Hj(b−x) in the interval[x1, x2]. The second derivative ofF is

− ([a] − j + 1)([a] + j)

(a + x)3 − ([b] − j + 1)([b] + j)

(b − x)3

which is non-positive in the interval[x1, x2]. Hence,the functionF(x) attains its minimum whenx = x1 orx = x2. In both cases, eithera+x or b−x is an integer.

(2) Supposea�b are integers. The expressionHj(a) +Hj(b)−Hj(a + 1)−Hj(b− 1) is always nonnegativewhena < j . Whenj > a,

Hj(a) + Hj(b) − Hj(a + 1) − Hj(b − 1)

= (b − a − 1)(a + b)

2a(a + 1)b(b − 1).

This shows thatHj(a) + Hj(b) is a minimum whena = b or a = b − 1. �

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 389

To complete the proof of the theorem, consider a pairM,W consisting of an offline algorithm with weightsW ona configurationM with n disks. Let�0, . . . ,�n−1 be the sumof weights on each of then physical disks with respect toW.We have

∑i �i = L. Fix j and consider the sum

∑i Hj (�i ).

By repeated application of Lemma3(1) to pairs of�i ’s whichare not integers, we can find�i,j such that all�i,j are in-tegers,

∑i �i,j = L and

∑i Hj (�i,j )�

∑i Hj (�i ). Now,

apply Lemma3(2) repeatedly, to the smallest and largest of�i,j ’s. Each step lowers

∑i �2

i,j , thus, after a finite numberof steps we obtain�i,j , integers, such that

∑i �i = L and for

anyi, l, |�i,j −�l,j |�1, thus�i,j = L/N = h/d for all i. ByLemma 3(2) we also have

∑i Hj (h/d)�

∑i Hj (�i ) for

all j and consequentlynS(h/d) = ∑i S(h/d)�

∑i S(�i ).

By Lemma2 we also haveS(�i )�Si(M,W). On the otherhandnS(h/d) is the seek achieved by the partition algo-rithm. From Lemmas3 and2 it is also easy to see that equal-ity holds only if M is semistructured andW is a partitionalgorithm. �

The theorem shows the advantages of a semistructuredconfiguration over a random configuration. Homogeneousworkloads in which the activity of all files is equal are com-mercially common. One example is given by Teradata[33],a database application which uses hash functions to achievecomplete homogeneity. In a random configuration the ran-dom workload will most likely be divided evenly across theentire disk while in a semistructured configuration all theactivity will be handled by a more localized radial zone, say,the outer radial half of the disk. The performance compari-son is thus the same as that of a uniformly random workloadacross an entire disk, compared with a uniformly randomaccess workload restricted to half the disk.

Remark. We restricted ourselves to the case in whichddivides h simply in order to simplify the definition of asemistructured configuration and the theorem. Both the def-inition and the theorem can be extended to the general case.Note that by the second lemma ifd does not divideh the op-timal seek minimization solution is not load balanced. Forexample, for a pair of physically mirrored disks withh =3, the offline algorithm with weight vector(1,1,0) has anaverage seek of 1/6 while the load balanced weight vector(1,1/2,0) yields an average seek of 2/9.

We also note that the result in the theorem can be gen-eralized to include synchronous writes. The inclusion ofwrites places some restrictions on theQ vector of optimalsemistructured configurations, it must be symmetric with re-spect to the radial center of the disk.

3.3. Comparison of online and offline seek minimization

It is interesting to compare the seek minimization quali-ties of offline and online algorithms. Forh = 4 we have thefollowing simple claim which shows that surprisingly of-fline algorithms are at least as good if not better then onlinealgorithms, at least when the access pattern is uniform.

Proposition 4. If h = 4 and the access pattern is uniformthen partition with respect to a semistructured configurationis optimal among all pairs consisting of a choice of a config-uration and a mirroring algorithm, either online or offline.

Proof. Since partition only uses locations inBQ to servicerequests all heads in the system point to different files, sinceno two copies of the same file are inBQ. Sinceh = 4 andall activity levels are equal we see that the cumulative accessprobability of all the files to which the heads are pointing atany given moment is 1/2. We see that with probability 1/2a read request will result in a head movement. For partitionthis head movement is always minimal, to a neighboring file.For any other algorithm the probability of head movementin any given state is at least 1/2 and the head movement is atleast one file, hence, no algorithm can have smaller averageseek than partition. �

Remark. We note that the normalized average seek for par-tition is approximately 1/6 whenh = 4. For NS with phys-ical mirroring as the configuration, a simple Markov chaincalculation gives an average seek value of approximately0.188 whenh = 4.

Whenh > 4 this result is no longer true as was shown inour computations of the average seek forh = 20. This hasalready been observed in[12]. As h becomes large, phys-ically mirrored disks andCh

2 yield better results than par-tition. We note however that both these configurations arefor two disks and their generalization to many disks will beshown in the next section to suffer from poor load balanc-ing. Partition on the other hand can be generalized to anynumber of disks with semistructured configurations whichprovide good load balancing.

The main advantage of online algorithms over offline al-gorithms lies in the former’s better ability to adjust to rapidlychanging workloads. On the other hand offline algorithmsare much easier to implement since they do not require realtime communication between different disks and processors.

3.4. Generic semistructured configurations and virtualstorage

We saw that semistructured configurations are good atseek minimization in the presence of both online and offlinealgorithms. We will show in the next section that some par-ticular semistructured configurations likeI orG also providevery good load balancing. Naturally one might wonder as towhy we should bother with generic semistructured configu-rations as well.

Virtual storage disk arrays in which data is constantly be-ing moved about the system are very common commercially.Such systems employ a garbage collection mechanism whichcontinuously provides available disk spaces by eliminatingobsolete data in the system. For example, one very populargarbage collection mechanism which optimizes write per-

390 E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395

formance is known as LSF[30]. LSF and similar garbagecollection mechanisms change the configuration, so even ifa system begins with anI or G configuration it will not re-main in that state.

It is easy to insure that the garbage collector maintainsonly semistructured configurations by employing indepen-dent garbage collection inBQ and it’s complement. SinceBQ and it’s complement consists of contiguous disk regionsthis will not hamper the work of LFS, for which the notionof contiguous disk space is important.

Our results and those of the next section show that theresulting generic semistructured configurations all sharegood performance which is superior to that of random con-figurations, hence system performance will be maintainedthroughout the lifetime of the system despite the constantconfiguration changes.

4. Load balancing

We turn our attention to a second performance metric,bandwidth. Since high bandwidth is achieved by exploitingthe parallelism inherent in disk arrays, we equate high band-width and load balancing with the ability of the system tohandle concurrently requests to many files. This ability doesnot depend on the internal order of the files in the disk. Wewill discuss the casesd = 2 and> 2 separately.

4.1. Load balancing with d=2

A multi graph is a graph in which more than one edgebetween vertices is allowed.

Definition. Given a configurationM, the configurationgraph (multi graph) GM is a graph whose vertices corre-spond to the disks of the system and whose edges corre-spond to files. A file connects the two disks on which it’scopies reside.

Examples. (1) The configuration graph ofIh is the com-plete graph onh + 1 vertices since every two disks share asingle file in common.

(2) The configuration graph ofGh is the complete bipartitegraph on(h, h), i.e. the vertices of the configuration graphare divided into two disjoint sets of sizeh. There are noedges between members of the same set but there is an edgebetween any two members which are not in the same set.

(3) The configuration graph ofChN consists of a cycle of

lengthN, with each edge multipliedh/2 times.(4) The configuration graph ofPh consists of two vertices

joined byh edges.If a configurationM has parametersN, h thenN is obvi-

ously the number of vertices ofGM andhwill be the degreeof the vertices in the graph. For any graphG we denote it’svertex set byV (G) and it’s edge set byE(G).

The load balancing capabilities of random graphs havebeen studied in[31,32] assuming the complete graph as theconfiguration graph and forh fixed in [1]. Following [32]we consider the following.

Let G be a multi graph. Aschedule fis a mappingf :E(G) −→ V (G) such thatf (e) is incident toe for alledgese ∈ E(G). The load of a schedulef, L(f ), is definedas Maxv∈V (G) |f−1(v)|. The load ofG, L(G) is definedto be Minf L(f ), where the minimum is taken over allschedules ofG. To explain the terminology, consider thestate of the system at some given time instance. Assumethat there are outstanding read requests to a subsetA of thefiles in the system. We may form the subgraphGA of GM

which has the same vertex set but whose edges correspondonly to files inA. A schedulef of GA is simply a rule whichdecides which copy of each file inAwill service the requestto that file. The load is the maximal number of file requeststhat any disk in the system will have to support.

For example the system will be able to concurrently sup-port requests to a subset of filesA if GA has a schedule withload 1, or in other wordsL(GA) = 1.

If � is a set of vertices of a graphG, we let e(�) de-note the number of edges ofG both of whose endpointsare in �. The unavoidable loadof �, L(�) is defined tobe e(�)/|�| rounded up to the next integer value. We letL∗(G) = Max� L(�) where the maximum is taken over allsubsets ofV (G). It is proved in[32] thatL(G) = L∗(G).As a special case this result means thatL(G) = 1 if andonly if all the connected components ofG are either treesor contain a single cycle (unicyclic).

As a first application of these ideas we have the following:

Proposition 5. If h is even then, every configuration M canbe converted into a semistructured configurationM ′ by per-muting the internal order of files in the various disks. Inparticular M andM ′ share the same configuration graph.

Proof. To construct a semistructured configuration via in-ternal order permutations in the disks we need to find foreach diski a subsetTi of h/2 files from the disk such thatthe Ti are disjoint. Then the unionT = ∪Ti will consistof all files. We then shuffle the order of appearance of filesin each disk so that the files inTi appear contiguously andthus obtain a semistructured configuration. To prove the ex-istence of suchTi we note that in terms of the configurationgraph the choice ofTi corresponds to a schedule ofGM withload h/2. Let � be any set of vertices inGM . Since thereare h edges incident to any vertex in the graph and sincethe edges which are counted ine(�) have both endpointsin � we havee(�)� |�|(h/2), the proposition now followsimmediately from the equationL(GM) = L∗(GM). �

The proposition essentially states that generic semistruc-tured configurations and random configurations have similarload balancing capabilities.

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 391

To measure the load balancing capabilities of families ofconfigurations we introduce, following[1], the notion ofthe load 1 bandwidthof a family of configurations. LetGbe a graph. Recall that in terms of the configuration graphoutstanding read requests correspond to edges. Our modelfor picking a random set of read requests, or equivalently arandom edge subsetAwill be the standard random subgraphmodel in which each edge ofG is chosen independently withsome fixed probabilityp. We letG(p) denote the probabilityspace on the subgraphs ofG induced by this model.

Consider a familyG = (Gn) of regular graphs, withdegreeshn = h(Gn). Consider the random variable�n,p :Gn(p) → {0,1} whose value onA is 1 iff the subgraph ofG induced by the edge setA is of load 1.

We define thecycle thresholdof the family G as

CT (G) = Sup

{1

2hnp | lim inf

n→∞ E(�n,p) = 1

}.

Note that12hnp|V (Gn)| is the expected number of edges in

a subgraph inGn(p), and thereforeCT (G) represents theasymptotic ratio of edges to vertices for the maximump forwhich the subgraphGA in G(p) will be of load 1 with prob-ability approaching 1. Equivalently, the load 1 bandwidthdescribes the maximal portion of the system’s disks whichcan be concurrently active serving request without queuingrequests. It is our measure of the load balancing capabilitiesof families of configurations.

It is shown in [1] that for any family of configurationswith hn tending to infinity, we haveLB(G)�1/2, whichmeans that asymptotically no mirrored system withn diskscan concurrently support more thann/2 randomly chosenread requests with high probability. This is still much betterthan theO(

√n) requests that a nonmirrored system can

support. We will consider a family of configurations to begood at load balancing ifLB(G) = 1/2.

4.1.1. ExamplesWe examine the load balancing capabilities of the pre-

viously introduced families of configurations. We note thatthe case ofh fixed is treated in detail in[1] where explicitoptimal families of configurations arising from number the-ory are presented. Thus, we will consider families in whichh tends to infinity.(1) Let Gn be the configuration graph of a system withndisks composed ofn/2 pairs of physically mirrored disks.For a subsetA of the files the load is 1 only if at most 2 ofthe files inA reside on a given pair of physically mirroreddrives. It can be easily shown that this criterion leads to theconclusion that a physically mirrored system withn diskscan concurrently support no more thanO(n2/3) requestsleading toLB(G) = 0.(2) LetGn = G(Ch

n), for someh. GA will have load 1 onlyif at most 3 active files reside on any given disk. This leadsto the conclusion that an system interleaved declustering

system withn disks can support onlyO(n3/4) random readI/O concurrently and henceLB(G) = 0.(3) LetGh = G(Ih), thenGn is the complete graph onh+1vertices. By the results of[16] (see also[2, Chapter 10]) onthe phase transition for random graphs we haveLM(G) =1/2.(4) Let Gh = Gh be the family of group rotate configura-tions. By arguments which are very similar to the case ofcomplete graphs we haveLB(G) = 1/2. For the sake ofcompleteness we provide a sketch of the proof. Assumep =c/h with c < 1. Let Bh,h be the complete bipartite graphwhich as noted above is the configuration graph ofGh. LetD andE be the two sets of vertices withh elements each,that is vertices are numbered 1, . . . ,2h and each vertex inthe rangeD = {1, . . . , h} is connected to each vertex in therangeE = {h + 1, . . . ,2h}. We first compute the averagenumber of cycles in a random subgraphGA of Bh,h(p).

In a bipartite graph all cycles have even length. To specifya cycle of length 2k we choose an orderedk tuplex1, . . . , xkfrom D and an orderedk tupley1, . . . , yk from E and con-sider the cycle(x1, y1, x2, y2, . . . , xk, yk), this representa-tion is unique up to a cyclic permutation on both thexs andys and orientation reversal, hence the number of cycles inBh,h is (h(h − 1) . . . (h − k + 1))2/2k < h2k/2k.

The probability of existence of the specified cycle inGA

is p2k = c2k/h2k since it has 2k edges, so the expectednumber of cycles of size 2k in H is bounded byc2k/2kand hence ifc < 1 the total expected number of cycles isdominated by a converging geometric series and hence isbounded independently ofh.

In our situation bad components ofGA with load greaterthen 1 are those containing at least two cycles. The compo-nent must therefore include either two cycles with a com-mon vertex or edge or a pair of disjoint cycles connectedby a path. We will show that the probability of such ele-ments emerging inside a component ofGA tends to zero ash tends to infinity. We concentrate on the case of two cycleswith a common vertex or edge, the other case being nearlyidentical. We can view a pair of cycles with a common edgeas consisting of a cycle and two points, not necessarily dis-tinct, on the cycle with a path consisting ofl vertices notin the circle, also connecting them. The description is notunique. But that will not affect our calculations. For a cycleof size 2k there are 4k2 choices for pairs of vertices. Givena pair (z1, z2) and assumingz1 is in D and z2 in E (theother cases are again very similar) we choose a path withl = 2j (l must be even) vertices by specifying ordered setsx1, . . . , xj in D andy1, . . . , yj in E, the path is then givenby z1, y1, x1, y2, x2, . . . , . . . , yj , xj , zj and hasl+1 edges.The number of paths is then(h(h − 1) . . . (h − j + 1))2 <

h2j = hl and the path probability iscl+1/hl+1 hence theexpected number of paths of lengthl between pairs of pointson a cycle of length 2k is bounded by(2k)c2k+l/h. If c < 1then summing over allks andls we obtain a bound ofa/hfor some constanta, hence the expected number of com-ponents with at least two cycles tends to zero. This shows

392 E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395

thatLB(G) > (1/2)h(c/h) = (1/2)c for all c < 1 whichcompletes the argument.(5) Families of random and semistructured configurations.It is easy to show using methods similar to those of theprevious item that families of random configuration graphswith h tending to infinity have load 1 balance equal to 1/2and by proposition5 the same holds for families of genericsemistructured configurations.

To summarize, the familiesIh,Gh, families of randomconfigurations and generic semistructured configurations areall asymptotically optimal load balancers whilePh andCh

n

configurations are not.

4.2. Load balancing ford > 2

When d > 2 the configuration graph is replaced by asimilarly definedconfiguration hyper graph. An alternativedescription is via a bipartite graphGwith vertex set(D, F ),with the setD havingn vertices representing the disks in thesystem andF havingnh/d vertices representing the files. Avertexi ∈ D is connected via an edge to an elementj ∈ F iffile j has a copy on diski. At a given time instance we assumethat we have a subsetA of files for which there are readrequests and consider accordingly the subgraphGA of G. Inthis setting all requests inA can be supported concurrently ifand only if the graphGA has aset of distinct representatives,SDR for short, which is a matching between all elements ofF and elements ofD. For a subsetB of A let N(B) denotethe subset ofD consisting of the neighbors ofB. By Hall’smatching theorem, see[20], an SDR exists iff for all subsetsB we have|B|� |N(B)|.

One application is the following proposition:

Proposition 6. If d = h then for any given configuration Mthe set of all files can be supported.

Proof. If d = h then the graphG is a bipartited regulargraph and hence by Hall’s Theorem[20], there exists a per-fect matching forG. �

We may define the load 1 bandwidth ofM exactly as be-fore. Our main result states that for random configurations,as the number of copiesd grows the load 1 bandwidth ap-proaches 1, which means that the system can exploit it’sentire potential bandwidth.

We use the following notation. Letn be an integer,c someconstant satisfyingc�1 and definek = cn. We will assumethatk is an integer.

Theorem 7. (1) If c < e− 2d−1 then

limn→∞ Pr(SDR exists) = 1.

(2) If c > 1 − e−dc then

limn→∞ Pr(SDR exists) = 0.

Proof. We shall first prove statement 1. By Hall’s basicmatching theorem[20], we can find a matching of all ele-ments ofA with elements ofD if for any subsetI ⊂ M

we have|N(I)|� |I |. Therefore the probability that an SDRmatching will not be found is bounded by

Pr(no SDR matching)

�∑I⊂A

P r(|N(I)| < |I |)

=k∑

m=2

(k

m

)Pr(|I | = m and|N(I)| < |I |).

If |I | = m we can think ofN(I) as the image of a mapf : {1, . . . , dm} −→ D. We have

Pr(no SDR matching)

�k∑

m=2

(k

m

) |{f : |image f |�m − 1}ndm

=k∑

m=2

(k

m

)m−1∑l=1

|{f : |image f | = l}|ndm

�k∑

m=2

(k

m

)m−1∑l=1

(n

l

) (l

n

)dm

. (1)

We first show that forc < e− 2d−1

limn→∞

k∑m=2

(k

m

)(n

m − 1

) (m − 1

n

)dm

−→ 0.

Let us compare successive terms in the sum. Denote byem the summand corresponding to the indexm. One easilyverifies thate2 < 1/n. We have

em+1

em= [cn]

m + 1

n − m + 1

m

(m

m − 1

)dm (mn

)d.

Sincee2 tends to zero and the successive ratios in the casesof interest to us will be bounded by some fixed constantwe may assume thatm is large hence we may replace witharbitrary accuracy the expressionem+1

emby

hm = cn − m

m

n − m

med

(mn

)d.

We then have

hm <cn2

m2 ed(mn

)d = edc(mn

)d−2.

We shall denote the right-hand side byjm consider now

rm = e2

m−1∏l=2

jl.

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 393

Let us examinercn.

rcn ≡cn−1∏l=2

jl

≡ edcnccn(cn!)d−2/ncn(d−2)

≡ edcnccn(cn)cn(d−2)/(ncn(d−2)ecn(d−2))

= (e2cd−1)cn,

where≡ stands for equality up to a rational function (quo-tient of polynomials) ofn. Thus, if c < e−2/(d−1) thenrcnis exponentially small. From the definition one can easilyverify that we can find a constantb such that forn >> 0jl < 1/2 for all l < bn, therefore the sum

∑bnm=2 em is dom-

inated by a convergent geometric series whose first terme2tends to zero. We note that the sequencejl is increasing andhence the sequencerbn, . . . , rcn is either decreasing (jcn <

1) or decreases and then increases. In either case we get thatfor all l satisfyingbn < l < cn we haverl < max(rbn, rcn).Since both are exponentially small the sum of terms tends

to zero. We now show that under the conditionc < e− 2d−1

we have

(1 + �)(m−1)−l

(n

l

) (l

n

)dm

<

(n

m − 1

) (m − 1

n

)dm

for l < m − 1 and some� > 0. Let gl be the left-hand sideof the last inequality. Consider successive quotientsti =gi+1/gi for l < i < m. Then

ti ≡ edm/i(n − i)/(i + 1) > ed1 − e− 2

d−1

e− 2d−2

,

where ≡ denotes equality up to an arbitrarily close to 1constant. It is easy to verify for smalld that the right-handside is greater then one, while asymptotically

1 − e− 2d−2

e− 2d−1

≡ 2

d − 1

henceti ≡ 2ed/(d − 1) > 1 which establishes the claim.Thus, the sum

m−1∑l=1

(n

l

) (l

n

)dm

is dominated by an increasing geometric series whose topelement isem hence we only have to consider the sum oftheem and we are done proving statement 1.

We now sketch the proof of statement 2. We note that theprobability asn tends to infinity that an element inD will bein N(A) is 1− e−dc and hence the expected size ofN(A)

is (1 − e−dc)n which should be more thencn for an SDRmatching to exist. We also have to compute the variance ofN(A) to show that with probability tending to 1 the size ofN(A) is close to the average. The size ofN(A) is the sum ofthe random variablesXi , whereXi takes the value 1 ifi is in

the image and zero otherwise. Since the covariance of pairsXi,Xj is easily shown to be negative the variance of thesum is smaller then in the case of independent variables byFeller [17, Theorem 9.5.2]. An application of Chebyshev’sinequality ends the argument.�

The lower bounde−2/(d−1) is certainly not best possible.A simple inspection of inequality (1) shows that we canimprove the bound to exp(−2/(d−1))

1−(d+1)exp(−d)using essentially the

same argument and noting that one can consider only mapsf which are onto and in fact each element has an inverseimage of size at least 2. This is still far from the correctvalue which seems to be much closer to the upper bound.Experiments show that already ford = 3 the thresholdc isabout 0.91 so almost all disks are concurrently usable.

5. Conclusions

We have introduced a large class of configurations,the semistructured configurations. We proved thatconfigurations in this class are precisely those which allowoptimal offline seek minimization. We showed that genericsemistructured configurations and certain explicit fami-lies within this class such as group rotate and interleaveddeclustering allow good seek minimization, both offlineand online, and also allow good load balancing. Genericsemistructured configurations can also be easily maintainedin the context of virtual storage systems. These results sug-gest that mirrored storage systems should be configuredusing these configurations.

We also showed that by considering more than two datacopies the storage array can use all it’s bandwidth at maxi-mum efficiency, experiments suggest that 3 data copies aresufficient.

The present work considers both bandwidth and seekminimization which is related to response time. Optimizingbandwidth and optimizing response time are not the same.In fact, in many cases optimization procedures like stripingwhich improve bandwidth also increase response time. Thispaper can be viewed as part of a larger study of the relationbetween bandwidth and response time optimization with thegoal of improving both. We hope to return to this theme infuture work.

Acknowledgments

We would like to heartily thank D. Berend and D. Arnonfor very helpful conversations. We also thank Oren Weiss,Oded Glinanski, Alon Ozer, Nir Levin and the entire storagesystems class of spring 99 at BGU for providing the actualsimulation data used throughout the paper.

The authors would like to thank Jim Gray for the encour-agement he provided us in writing this appendix. Despitethe mathematical error[9] is an inspiring paper.

394 E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395

Appendix

We consider the problem of calculating the average seekof NSM , the nearest server algorithm with respect to a con-figurationM. Our main observation is that the stationarydistribution ofNSM is not uniform even though the accesspattern is.

Consider for example a physically mirrored system withtwo disks andh = 3. It is easy to verify that the probabilityof the states(0,1) and(1,2) in the stationary distribution is1/4 while the probability of(0,2) is 1/2. The average seekcan then be calculated to be 1/6. This distribution thoughis not typical and is strongly influenced by the fact thath isvery small.

For a physically mirrored systems withd = 2 andlarge h the stationary probability and hence the aver-age seek have been computed in[12]. In particular thestationary distribution is plotted in that paper. It is ob-vious from the graph that the stationary distribution ishighly skewed though the distribution of the requestsis uniform.

This may seem at first counter intuitive but may be easilyverified either via simulations or simply by calculating smallexamples. The heads tend to hover around the center of thedisk more frequently then around the edges.

In [9] a computation of the average seek was made as-suming that the stationary distribution is uniform. The com-putation yielded an average seek of 1/(2d + 1). This resultis unfortunately incorrect. In the case ofd = 2, the correctresult as computed in[12] is about 0.161, not 0.2.

Looking back at our computations of the average seekof a random configuration we may notice that the value1/(2d + 1) provides a good approximation of the averageseek for these configurations. The reason is that the Markovchains associated with such random configurations tend tohave a much more uniform stationary distribution and hencethe calculation of[9] nearly applies.

Since the appearance of[9], the computation forphysically mirrored systems, has also been carriedout (sometimes independently) with the same error in[5–7,9,10,26,25,28,29]. The error is fatal only to a minorityof the papers, but some revision of some of the results isneeded in all of them. In these papers the error has been“generalized” to settings which include writes, non linearseek functions and others. The error has been validatedexperimentally in at least 3 papers. In addition other papersuse the computation to compare NS with partition. Sincepartition has an average seek of 1/3d they conclude that par-tition is better. Following simulations, the correct averageseek for physically mirrored systems with NS andd�10 isalways a little less then 1/3d, hence slightly better than par-tition. We conjecture that this is the case for alld but that NSand partition have asymptotically (asd becomes large) thesame performance.References: To date there appears to be no correct analysis,

analytic or experimental, of average seek time in physically

mirrored storage systems with more than two disks whichemploy the NS algorithm.

Anyone attempting to tackle this problem should read (andnot just cite)[12,13] first.

References

[1] N. Alon, E. Bachmat, Regular graphs whose subgraphs tend to beacyclic, Proceedings of Eurocomb, 2003.

[2] N. Alon, J.H. Spencer, P. Erdos, The Probabilistic Method, Wiley,New York, 1992.

[3] S.V. Anastasiades, K.C. Sevcik, M. Stumm, Maximizing throughputin replicated disk striping of variable bit-rate streams, UsenixTechnology Conference, 2002.

[4] O.I. Aven, E.G. Coffman, Y.A. Kogan, Stochastic Analysis of StorageSystems, D.Reidel, 1987.

[5] A. Bestavros, IDA-based redundant arrays of inexpensive disks, in:Proceedings of the IEEE Conference on Parallel and DistributedInformation Systems, 1991.

[6] A. Bestavros D. Chen, W. Wong, The reliability and performanceof parallel disks, Bell Labs Technical Report, 45312-891206-01TM,1989.

[7] R. Barve, E. Shriver, P.B. Gibbons, B.K. Hillyer, Y. Matias, J.S.Vitter, Modeling and optimizing throughput of multiple disks on abus, in: Proceedings of SIGMETRICS, 1999.

[8] M. Buddhikot, G. Parkular, Distributed data layout, scheduling andplayout control in a large scale multimedia storage server, TechnicalReport WUCS-94-33 Department of CS, Washington University, St.Louis, MO, 1994.

[9] D. Bitton, J. Gray, Disk shadowing, in: Proceedings of the 14thVLDB Conference, Los Angeles, August 1988, pp. 331–338.

[10] D. Bitton, Arm scheduling in shadowed disks, in: Proceedings of theIEEE Spring Compcon, February 1989, pp. 132–136.

[11] A. Borodin, R. El-Yaniv, Online Computation and CompetitiveAnalysis, Cambridge University Press, Cambridge, 1998.

[12] A.R. Calderbank, E.G. Coffman, L. Flatto, Sequencing problems intwo-server systems, Math. Oper. Res. 10 (4) (November 1985) 585–598.

[13] A.R. Calderbank, E.G. Coffman, L. Flatto, Optimum separation in adisk system with two read/write heads, J. ACM 31 (1984) 826–838.

[14] M.S. Chen, H.I. Hsiao, C.S. Li, P.S. Yu, Using rotational mirroreddeclustering for replica placement in a disk array based video server,in: Proceedings of the Third International Conference on Multimedia,1995, pp. 121–130.

[15] S. Chen, D. Towsley, A performance evaluation of RAID architecture,IEEE Trans. Comput. 45 (10) (October 1996) 1116–1130.

[16] P. Erdos, A. Renyi, On the evolution of random graphs, Magyar Tud.Akad. Mat. Kut. Int. Kozl 5 (1960) 17–61.

[17] W. Feller, Introduction to Probability Theory and Applications, vol.1, Wiley, New York, 1950.

[18] J. Gafsi, E.W. Biersack, Modeling and performance comparison ofreliability strategies for distributed video servers, IEEE Trans. ParallelDistrib. Systems 11 (4) (2000) 412–430.

[19] L. Golubchik, J.C. Lui, R.R. Muntz, Chained declustering: loadbalancing and robustness to skew and failures, in: Proceedings ofthe Second International Workshop on Research Issues in DataEngineering, 1992, pp. 88–95.

[20] P. Hall, On representatives of subsets, J. London Math. Soc. 10(1935) 26–30.

[21] M. Hofri, Should the two headed disk be greedy?-Yes, it should,Inform. Proc. Lett. 16 (1983) 83–85.

[22] H. Hsiao, D. Dewitt, Chained declustering: a new availability strategyfor multiprocessor database machines, in: Proceedings of the SixthInternational Data Engineering Conference, Los Angeles, February1990.

E. Bachmat, T.K. Lam / J. Parallel Distrib. Comput. 65 (2005) 382–395 395

[23] E. Koutsoupias, D.S. Taylor, The CNN problem and other k-servervariants, in: Proceedings of STACS 2000, 2000, pp. 581–592.

[24] T. Kwon, S. Lee, Data placement for continuous media in multimediaDBMS, in: Proceedings of the International Workshop on MultimediaDatabase Management Systems, 1995.

[25] R.W. Lo, N.S. Matloff, A probabilistic limit on the virtual size ofreplicated disk systems, IEEE Trans. Knowledge Data Eng. 4 (1)(1992) 99–102.

[26] M. Manzur-Murshed, M. Kaykobad, Seek distances in two headeddisk systems, Inform. Process. Lett. 57 (1) (1996) 205–209.

[27] A. Mourad, Doubly striped disk mirroring: reliable storage for videoservers, Multimedia, Tools Appl. 2 (1996) 253–272.

[28] Y. Manolopoulos, A. Vakali, Seek distances in disks with twoindependent heads per surface, Inform. Process. Lett. 35 (1991)37–42.

[29] Y. Manolopoulos, A. Vakali, Exact analysis of expected seeks inmirrored disks, Inform. Process. Lett. 61 (1997) 325–329.

[30] M. Rosenblum, J. Ousterhout, The design and implementation ofa log structured file system, ACM Trans. Comput. Systems 10 (1)(1992) 26–52.

[31] P. Sanders, Reconciling simplicity and realism in parallel disk models,in: Proceedings of SODA 2001, 2001, pp. 67–76.

[32] P. Sanders, S. Egner, J. Korst, Fast concurrent access to paralleldisks, in: Proceedings of SODA 2000, 2000, pp. 849–858.

[33] Teradata, DBC/1012 database computer system manual release 2.0,Doc. No. c10-0001-02, Teradata Corporation, November 1985.

[34] C.K. Wong, Algorithmic Studies in Mass Storage Systems, ComputerScience Press, 1983.