queueing models of raid systems with maxima of waiting times

26
Performance Evaluation 64 (2007) 664–689 www.elsevier.com/locate/peva Queueing models of RAID systems with maxima of waiting times Peter Harrison a , Soraya Zertal b,* a Imperial College London, South Kensington Campus, London SW7 2AZ, UK b PRiSM, Universit´ e de Versailles, 45, Av. des Etats-Unis, 78000 Versailles, France Received 27 February 2006 Available online 19 December 2006 Abstract A queueing model is developed that approximates the effect of synchronizations at parallel service completion instants. Exact results are first obtained for the maxima of independent exponential random variables with arbitrary parameters, and this is followed by a corresponding approximation for general random variables, which reduces to the exact result in the exponential case. This approximation is then used in a queueing model of RAID (Redundant Array of Independent Disks) systems, in which accesses to multiple disks occur concurrently and complete only when every disk involved has completed. We consider the two most common RAID variants, RAID0-1 and RAID5, as well as a multi-RAID system in which they coexist. This can be used to model adaptive multi-level RAID systems in which the RAID level appropriate to an application is selected dynamically. The random variables whose maximum has to be computed in these applications are disk response times, which are modelled by the waiting times in M/ G/1 queues. To compute the mean value of their maximum requires the second moment of queueing time and we obtain this in terms of the third moment of disk service time, itself a function of seek time, rotational latency and block transfer time. Sub- models for these quantities are investigated and calibrated individually in detail. Validation against a hardware simulator shows good agreement at all traffic intensity levels, including the threshold for practical operation above which performance deteriorates sharply. c 2006 Elsevier B.V. All rights reserved. Keywords: Mean max response times; Fork join processes; Multi RAID levels; IO requests; Storage systems 1. Introduction Traditional, e.g. product-form, queueing networks cannot model synchronizations at parallel service completion instants. We approximate this effect in a queueing model of RAID (Redundant Array of Independent Disks) systems, derived by considering the explicit flow of control in the physical architecture. The contention in each parallel phase of processing is represented using an approach based on the M/ G/1 queue. The synchronization time is then the maximum of a collection of M/ G/1 queue sojourn times (also called waiting times or response times). We assume * Corresponding address: University of Versailles, Department of Computing, 45 Avenue des etats unis, 78000 Versailles, France. Tel.: +33 1 39 25 43 41; fax: +33 1 39 25 40 57. E-mail addresses: [email protected] (P. Harrison), [email protected] (S. Zertal). 0166-5316/$ - see front matter c 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.peva.2006.11.002

Upload: peter-harrison

Post on 21-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Queueing models of RAID systems with maxima of waiting times

Performance Evaluation 64 (2007) 664–689www.elsevier.com/locate/peva

Queueing models of RAID systems with maxima of waiting times

Peter Harrisona, Soraya Zertalb,∗

a Imperial College London, South Kensington Campus, London SW7 2AZ, UKb PRiSM, Universite de Versailles, 45, Av. des Etats-Unis, 78000 Versailles, France

Received 27 February 2006Available online 19 December 2006

Abstract

A queueing model is developed that approximates the effect of synchronizations at parallel service completion instants. Exactresults are first obtained for the maxima of independent exponential random variables with arbitrary parameters, and this is followedby a corresponding approximation for general random variables, which reduces to the exact result in the exponential case. Thisapproximation is then used in a queueing model of RAID (Redundant Array of Independent Disks) systems, in which accesses tomultiple disks occur concurrently and complete only when every disk involved has completed. We consider the two most commonRAID variants, RAID0-1 and RAID5, as well as a multi-RAID system in which they coexist. This can be used to model adaptivemulti-level RAID systems in which the RAID level appropriate to an application is selected dynamically. The random variableswhose maximum has to be computed in these applications are disk response times, which are modelled by the waiting times inM/G/1 queues. To compute the mean value of their maximum requires the second moment of queueing time and we obtain thisin terms of the third moment of disk service time, itself a function of seek time, rotational latency and block transfer time. Sub-models for these quantities are investigated and calibrated individually in detail. Validation against a hardware simulator showsgood agreement at all traffic intensity levels, including the threshold for practical operation above which performance deterioratessharply.c© 2006 Elsevier B.V. All rights reserved.

Keywords: Mean max response times; Fork join processes; Multi RAID levels; IO requests; Storage systems

1. Introduction

Traditional, e.g. product-form, queueing networks cannot model synchronizations at parallel service completioninstants. We approximate this effect in a queueing model of RAID (Redundant Array of Independent Disks) systems,derived by considering the explicit flow of control in the physical architecture. The contention in each parallel phaseof processing is represented using an approach based on the M/G/1 queue. The synchronization time is then themaximum of a collection of M/G/1 queue sojourn times (also called waiting times or response times). We assume

∗ Corresponding address: University of Versailles, Department of Computing, 45 Avenue des etats unis, 78000 Versailles, France. Tel.: +33 1 3925 43 41; fax: +33 1 39 25 40 57.

E-mail addresses: [email protected] (P. Harrison), [email protected] (S. Zertal).

0166-5316/$ - see front matter c© 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.peva.2006.11.002

Page 2: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 665

these sojourn times to be independent; initially, exponential random variables, as in an M/M/1 queue, and thengeneral.

Based on initial work in [12], Section 2 derives an exact recurrence formula for the Laplace transform of theprobability density function of the maximum of a set of independent exponential random variables, from which themean and higher moments follow. In the special case that all the constituent exponential distributions are identical,the well-known result for the mean value of the maximum in terms of harmonic numbers follows immediately.The recurrence is then generalized to approximate the mean of the maximum of independent, generally distributedrandom variables. This simplifies to the previous exact result when the constituent distributions are exponential but ingeneral requires their second moments. The accuracy of the approximation is assessed by comparison with simulationresults obtained for Erlang and Pareto constituent distributions, which typify the cases of small and large variancesrespectively.

RAID storage systems and existing analytical models are briefly reviewed in Section 3 and the results of Section 2are then used in our new multi-level RAID performance model in Section 4. This model assumes Poisson externalrequests but allows general disk seek, latency and transfer times. We determine the higher moments of the queueingtime in the M/G/1 queue by differentiating its Laplace–Stieltjes transform at the origin. The second moment is thengiven in terms of the third moment of the service time, which is obtained in turn from the assumed distributions ofseek time, rotational latency and block transfer time. Detailed studies of their principles of operation show that RAIDlevels 0–1 and 5 produce quite different demands on the disks in the array for each type of input-output access. Thisdifference is amplified in the corresponding queueing times; it is seen in both the explicit simulation of the physicalsystems’ operation and in the calculation of mean and variance of queueing time in the analytical model.

The accuracy of the model is assessed in Section 5 by comparing the analytical predictions with a simulation of theactual system at the operational level. The quantitative results are presented as graphs of mean system response timeagainst traffic intensity, showing generally good agreement and hence providing justification for our approach. Thevalidity of the assumption of Poisson arrivals was tested numerically by comparing with simulation models with non-Poisson input; the simulation output (mean response time) shows little change. This is consistent with the commonlyobserved robustness of the Poisson assumption for external arrivals. In addition, in Section 6 we further investigatepossible causes of inaccuracy in the model’s approximations. We isolate two possible sources, apart from the precisionof the mean–max algorithm assessed in Section 2.4: (a) the representation of the delay at a single disk as the responsetime in an M/G/1 queue; and (b) the effect of assuming such response times are independent when arrivals actuallyoccur simultaneously. The paper concludes in Section 7 with a summary of the present contribution, open questionsand suggestions for further research.

2. Maximum of random variables

Suppose a task forks into a number of subtasks that are processed in parallel independently. The task’s completioninstant is that of the last subtask to complete processing, whereupon the subtasks combine (join) to re-form the originaltask. The fork-join time of the task, i.e. the time elapsed between the fork instant and the join instant, is therefore themaximum of the subtasks’ processing times. In a Markovian environment, we derive the following:

Proposition 1. The maximum of n independent, negative exponential random variables, with parameters α =

(α1, . . . , αn), has probability density function fn(α, t) with Laplace transform Ln(α, s) given by the recurrence (fors ≥ 0):(

s +

m∑j=1

α j

)Lm(α, s) =

m∑j=1

α j Lm−1(α\ j , s) (1)

for 1 ≤ m ≤ n, where α\ j = (α1, . . . , α j−1, α j+1, . . . , αm) and L0(ε, s) = 1, where ε is the null vector of zerocomponents.

Proof.

Lm(α, s) =

∫∞

0e−st fm(α, t)dt

Page 3: Queueing models of RAID systems with maxima of waiting times

666 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

= −

∫∞

0e−st F

m(α, t)dt

= 1 − s∫

0e−st Fm(α, t)dt (2)

where Fm(α, t) = 1 − Fm(α, t) is the complementary distribution function of the maximum and the prime denotesdifferentiation with respect to t , so that fm(α, t) ≡ F ′

m(α, t). Now,

Fm(α, t) =

m∏i=1

(1 − e−αi t )

and so

F ′m(α, t) = −

m∑j=1

α j

[∏i 6= j

(1 − e−αi t ) −

m∏i=1

(1 − e−αi t )

]

=

m∑j=1

α j[Fm−1(α\ j , t) − Fm(α, t)

]=

m∑j=1

α j[Fm(α, t) − Fm−1(α\ j , t)

]Thus, by Eq. (2)

sLm(α, s) =

m∑j=1

α j[Lm−1(α\ j , s) − Lm(α, s)

]and the result follows. �

To start the recurrence, note that the maximum of zero non-negative random variables (m = 0) is zero withprobability 1 and so its density function has Laplace transform which is the constant 1.

Notice that the following simple probabilistic argument proves this proposition. The maximum of the randomvariables is the sum of the minimum – i.e. the time up to the instant that the first exponential duration ends –and the maximum of the remaining m − 1 random variables. These two times are independent by the memorylessproperty of the exponential distribution and, moreover, the remaining times have the same exponential distributionas the corresponding full times, for the same reason. Finally, the i th random variable is the least with probabilityαi/(α1 + · · · + αm).

2.1. Moments

From Proposition 1, we can immediately obtain the moments of the maximum of a set of exponential randomvariables.

Corollary 1. The kth moment Mn(α, k) of the maximum of n ≥ 1 independent, negative exponential randomvariables with parameters α = (α1, . . . , αn) is defined by the recurrence

Mn(α, k) =k

n∑j=1

α j

Mn(α, k − 1) +

n∑j=1

α j Mn−1(α\ j , k)

n∑j=1

α j

for n ≥ 1 and M0(ε, k) = 0, for all k ≥ 1, with Mn(α, 0) = 1 for all n ≥ 0.

Proof. Differentiating Eq. (1) k times, using Leibnitz’s rule for differentiating products, and setting s = 0, we obtain(n∑

j=1

α j

)Mn(α, k) − k Mn(α, k − 1) =

n∑j=1

α j Mn−1(α\ j , k)

and the result follows. �

Page 4: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 667

Corollary 2. In the special case that all the parameters of the exponential distributions are equal, α j = α for1 ≤ j ≤ n we have:

Ln(α, s) =n!αn

n∏m=1

(s + mα)

(3)

Mn(α, k) = Mn−1(α, k) +k

nαMn(α, k − 1)

=kα

n∑m=1

Mm(α, k − 1)

m. (4)

Note, in particular, from Eq. (4), we get for the mean of the maximum

Mn(α, 1) =1α

n∑m=1

1m

.

This special case is already well known, relating to the nth harmonic number; see [17] for a recent application inperformance evaluation. Similarly,

Mn(α, 2) =2α2

n∑m=1

m∑i=1

1mi

=2α2

n∑m=1

m−1∑i=1

1mi

+2α2

n∑m=1

1m2

and

Mn(α, 1)2=

1α2

[n∑

m=1

1m2 +

∑i 6=m

1mi

]=

1α2

[n∑

m=1

1m2 + 2

n∑m=1

m−1∑i=1

1mi

].

Hence, the variance of the maximum random variable out of the n is

Vn(α) =1α2

n∑m=1

1m2 .

Again, these results follow immediately from the above probabilistic argument. The maximum of n exponentialrandom variables consists of a sum of one exponential random variable with parameter nα and the maximum of n − 1random variables, which are independent. Consequently, the maximum is a sum of n independent, exponential randomvariables with parameters nα, (n − 1)α, . . . , 1α.

Notice that, as n → ∞, Vn(α) → 0, i.e. the maximum of n independent exponential random variables becomesmore consistent as n → ∞. This is not surprising since the maximum value will not change unless a newly includedrandom variable happens to be the biggest. This will occur with probability 1/n which approaches zero. Of course,the same argument applies to any n random variables provided they are independent — the specific distributions donot matter.

2.2. Clusters of identical exponential random variables

For a large RAID system (for example) composed of disks with different arrival and/or service rates, thecomputational cost of the above expressions can be very high. However, we may be able to regroup the disks into asmall number of clusters with a similar arrival rate within each cluster. Suppose a set of exponential distributions formsc clusters with ni exponential random variables having parameter αi in cluster i , 1 ≤ i ≤ c. Then we immediatelyhave:

Proposition 2. The maximum of∑c

a=1 na independent, negative exponential random variables in c clusters, withrate αa in cluster a (1 ≤ a ≤ c), has probability density function L(n, s) given by (for s ≥ 0):(

s +

c∑a=1

naαa

)L(n, s) =

∑a:na>0

naαa L(na−, s) (5)

Page 5: Queueing models of RAID systems with maxima of waiting times

668 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

for n 6= 0 and L(0, s) = 1, where n = (n1, . . . , nc), na− = (n1, . . . , na−1, na − 1, na+1, . . . , nc), 0 = (0, . . . , 0).The kth moment M(n, k) is given by:(

c∑a=1

naαa

)M(n, k) = k M(n, k − 1) +

∑a:na>0

naαa M(na−, k) (6)

for n 6= 0 and M(0, k) = 0, for all k ≥ 1, with M(n, 0) = 1.

2.3. Mean of the maximum of general random variables

We now derive an approximation for the mean value of the maximum of a set of independent, non-exponentialrandom variables. First consider T = max(T1, T2) for non-negative random variables T1, T2 with distributionfunctions F1(t), F2(t) having LSTs F∗

1 (θ), F∗

2 (θ) respectively. Then,

E[T ] = E[T1] + P(T2 > T1)E[T2 − T1|T2 > T1]

= E[T1] +

∫∞

0F1(t)dF2(t)E[T2 − T1|T2 > T1].

In the special case that T2 is exponential, with parameter α2 say, we get

E[T ] = m1 + F∗

1 (α2)E[T2 − T1|T2 > T1]

where m1 is the mean of T1. We make the approximating assumption that, at time instant T1 < T2, the random observerproperty holds with respect to T2. This assumption is, of course, valid in the special case that T1 is exponential. Thenwe have

E[T2 − T1|T2 > T1] =M2

2m2

where m2, M2 are the mean and second moment of T2 respectively. This is α−12 when T2 is exponential. We end up

with the approximation

E[T ] = m1 +M2 F∗

1 (m−12 )

2m2. (7)

By construction, this result is exact if both T1 and T2 are exponential. Otherwise, in general, we need to approximatethe Laplace transform of the density of the maximum of k − 1 random variables when considering the maximumof k. To do this we use Eq. (1), which is the correct Laplace transform if the maximized random variables are allexponential. We immediately obtain the following approximation:

The expected value of the maximum of n independent, non-negative random variables with means m =

(m1, . . . , mn), α = (m−11 , . . . , m−1

n ) and second moments M = (M1, . . . , Mn) is approximated by the functionI (n, α, M) defined by the recurrence, for k = 2, . . . , n,

I (k, α, M) =1k

k∑i=1

I (k − 1, α\i , M\i ) + αi Mi Lk−1(α\i , αi )/2 (8)

I (1, α1, M1) = 1/α1

where Lk−1(α\i , s) is the Laplace transform of the probability density function of the maximum of k − 1 exponentialrandom variables with parameters α\i , defined by Proposition 1.

Again, by construction (an easy inductive proof), the result is exact if all the random variables are exponential.Notice that, when exact, all the summands give the same result. When approximate, the result is the average obtainedby picking each of the k random variables in turn as the last in the sequence, and maximizing this and the maximumof the rest.

Page 6: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 669

Table 1Comparison with Erlang (low-variance)

N Exp-1 Erlang-2 Erlang-3 Erlang-4Mod Sim % err Mod Sim % err Mod Sim % err

1 1.000 1.000 1.003 −0.334 1.000 0.999 0.062 1.000 0.999 0.0602 1.500 1.375 1.373 0.135 1.313 1.271 3.281 1.281 1.195 7.2074 2.083 1.813 1.772 2.265 1.677 1.546 8.448 1.609 1.380 16.648 2.718 2.288 2.182 4.881 2.074 1.806 14.84 1.966 1.555 26.43

16 3.381 2.786 2.588 7.648 2.488 2.061 20.74 2.339 1.716 36.30

Finally, we look at the special case where all the parameters are equal for all i , say αi = α and Mi = M for1 ≤ i ≤ n. We then have, from Corollary 2,

Lk−1(α, α) = 1/k

so that

I (k, α, M) = I (k − 1, α, M) +Mα

2k

hence

I (k, α, M) = 1/α + (Mα/2)

k∑i=2

1/ i.

2.4. Accuracy of the approximation

A pilot assessment of the accuracy of the approximation described in the previous section compared it againstsimulations of the maxima of a number N of identical random variables of two types: Erlang and Pareto. Thesimulations were run 100,000 times, giving 98% confidence bands of the order 0.01, cf. [12].

Each test distribution was standardized to have unit mean value so that the approximate mean–maximum isdetermined solely by the second moment. Notice that, even when the variance is zero, the second moment is the squareof the mean, viz. 1. Consequently, the approximation’s estimate will always diverge as the number of parallel randomvariables maximized increases. Thus, for N deterministic random variables, here each equal to 1 with probability 1,the exact mean–maximum is 1 whereas the approximation diverges to infinity with N . Thus, the approximation is notappropriate for small variances. This is illustrated in Table 1 where the approximation is tested for Erlang-2, Erlang-3and Erlang-4 distributions. The mean of a k-phase, Erlang-k distribution with parameter λ is k/λ and so we chooseλ = k. The variance is therefore k/λ2

= 1/k which tends to zero as k → ∞. Thus the approximation deteriorates atlarger k, as we see from the Table 1. The second moment of the k-phase Erlang is 1 + 1/k and we see a 36% error for16 parallel Erlang-four random variables. Each of these has variance 0.25 and so we see poor agreement at moderatelysmall variances for more than eight parallel random variables — all overestimates as expected. However, for up tofour in parallel, the accuracy is quite acceptable; this happens in reads from mirrored disks and RAID accesses withsmall numbers of blocks. Also included in each row of the table is the mean of the maximum of N parallel exponentialrandom variables, each with unit parameter. By Corollary 1, this is just the N th harmonic number and it can be seenthat it overestimates seriously; more than double the error in its best case of 16 parallel Erlang-four distributions.

However, in practice, waiting times in queues tend not to have very low variance — it would be perhaps easier topredict if they did. Consequently, we tested the accuracy of the approximation near the opposite extreme, against highvariance, heavy-tailed Pareto distributions. Again these were chosen to have unit mean and zero distribution functionat the origin. The form of the distributions chosen is FP (x) = 1−α(x +γ )−β , where β > 2 for the first two momentsto be finite. In order to pass though the origin and have unit mean, we require α = γ β and γ = β − 1. This gives asecond moment M2 = 2 + 2/(β − 2), which we use to parameterise the approximation. We call a Pareto distributionwith these properties Pareto-β and compare our approximation with simulation for the mean–maximum of Pareto-4and Pareto-5 random variables; see Table 2.

Page 7: Queueing models of RAID systems with maxima of waiting times

670 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Table 2Comparison with Pareto (high-variance)

N Exp-1 Pareto-4 Pareto-5Mod Sim % err Mod Sim % err

1 1.000 1.000 1.004 −0.381 1.000 0.994 0.6142 1.500 1.750 1.579 10.82 1.667 1.567 6.3504 2.083 2.625 2.327 12.81 2.444 2.269 7.7448 2.718 3.577 3.261 9.698 3.290 3.129 5.173

16 3.381 4.571 4.394 4.027 4.174 4.153 0.512

Fig. 1. Requests flow in a RAID storage system.

It can be seen that the agreement is much better here than for the low variance cases. In fact the approximation is atits worst for moderately small numbers in parallel (N ), improving as N reaches 16. As expected, the approximationimproves as the parameter β increases, giving a lower variance closer to that of the exponential, 1. The exponentialmean–maximum values are repeated in this table and show underestimates, again as expected since the Pareto secondmoments are greater than that of an exponential random variable with mean 1, viz. 2. This indicates a degree offlexibility in the new approximation.

This preliminary validation of the approximation suggests at the very least that many mean–maxima of waitingtimes will be well approximated by the recurrence of the previous section. Recall too that, when these waiting timesare exponential, the recurrence is exact. Indeed, if the waiting times are phase-type, the mean of their maximum canalso be computed exactly, the maximum also being phase-type. This calculation has exponential complexity in N butan efficient polynomial approximation was obtained in [3]. This could be used in cases where ours is too inaccurate,for example low variance Erlang distributions, a special case of phase-type.

3. RAID storage system

A RAID storage system consists of a disk system manager and a collection (array) of independent disks. The disksystem manager is a software component; it receives requests from the multiple system users. These requests areconsidered logical because they are independent of the physical configuration of the storage system. Requests mayarrive from different users at various rates λ′

j . The disk system manager subdivides the data into blocks called stripeunits and distributes them across the collection of disks. Consequently, for each logical request, it generates a numberof physical requests and sends them to the associated disks. Each disk i of the array receives physical requests at rateλi , as shown in Fig. 1. Finally, the disk system manager waits for the (physical) responses from each requested diskto construct the (logical) response to each logical request, which it then sends to the corresponding user.

The request subdivision and distribution process is performed according to the data-placement/redundancy patternover the disks. In fact, there are various RAID levels corresponding to these patterns [7,8], but we are interested inthe two most common and useful ones: RAID0-1 and RAID5. Added to requests’ independent executions on theasynchronous disks, these access patterns introduce fork-join problems of the type we have described and analysed inSection 2. The model we develop uses the notation in Table 3.

Page 8: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 671

Table 3Notation for the parameters of the multi-RAID models

Parameter Description

N The number of disks in the storage system.C The number of cylinders on a disk.B The logical request size in terms of transfer blocks.Qi The waiting or queuing time at disk i .Si The seek time on a disk i .Ri The rotational latency to move from a random point to the target block.RMAX The full disk rotation time.t The transfer time of one disk data-block from or to a cylinder.T The bus transfer time of one disk data-block.λ The logical request arrival rate to the storage system.P j The proportion of the RAID j area in the whole storage space.λR j = λP j The logical request arrival rate to the RAID j area.pi j The probability that a given RAID j block is on disk i .λi The physical request arrival rate to disk i .λi j The physical request arrival rate to a RAID j area on disk i .Zr (i) The response time for a read request on disk i .Zw(i) The response time for a write request on disk i .Zr The mean response time for read requests in the storage system.Zw The mean response time for write requests in the storage system.Z The mean response time for any request in the storage system.pw The probability that a request is a write.pr The probability that a request is a read.ps The probability that a request’s access is sequential.

Fig. 2. RAID0-1 and RAID5 levels.

3.1. RAID levels

In the RAID0-1 level, both shadowing (full redundancy) and striping are used. The disk collection is divided intotwo groups: native disks and mirror disks, which are both subdivided into stripe units. All data are duplicated anddistributed on both the native disks and the mirror disks as shown in Fig. 2. A read physical request is sent to thenative or to the mirror disk while a write physical request is sent to both of them in order to maintain the native andmirror data coherency. In the RAID5 level, block striping and parity based redundancy are used to improve perfor-mance in the sense of the rate of processing of logical requests at low cost. The redundancy units are spread acrossthe disks in a cyclic manner. Thus, the redundancy disk may be different for every stripe,1 which enhances the writes’parallelism [2].

3.2. Multi-level RAID storage system

RAID levels have different characteristics and each of them gives best performance for a quite narrow range ofapplications. It is desirable to tailor a placement scheme to a set of data according to its workload characteristics:

1 A stripe is a collection of native data blocks (stripe units) stored on a subset of the disks, with the redundancy stored on another disk.

Page 9: Queueing models of RAID systems with maxima of waiting times

672 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

e.g. the type (read/write), the size and the arrival rate of requests. Coexisting, multiple RAID levels can ensure goodexploitation of the storage system’s space and optimise its performance by using data/redundancy placement policiesand request execution schedules appropriate to each workload class [22].

3.3. RAID analytical models

Several analytical methods have been developed to study and evaluate RAID system performance. Amongst theearliest, [4,5,16,6] approximated access latencies. These were followed by various other contributions, each of whichfocused on a different aspect. For example, [18–20] considered the effect of the execution mode (normal, degradedor recovery) on performance, [1,11] investigated fault tolerance, and [9,10,21] analysed the effects of caching andproposed optimizations for both caching mechanisms and the controller.

However, none of these models is capable of describing the performance of a multi-level RAID architecture. In fact,each disk of such a storage system can contain data stripe units belonging to both RAID0-1 and RAID5 organizations,which results in a significant difference in loading between the disks. In turn, this unbalanced workload leads to adifference between the disks’ waiting times — even for physically homogeneous disks. The use of asynchronousdisks introduces a significant difference between the seek times by allowing independent execution of requests.Consequently, all the typical assumptions about synchronous disks, well balanced load across the disks and similarphysical requests’ arrival patterns become invalid for multi-level RAID storage systems, and hence also any modelsbased on them. The performance of such systems is what we address. The synchronization and access latency problemswere first considered in [23,12], the former concentrating on the detailed modelling of the actual operations comprisinga read or write access. Preliminary calibration of the resulting model was presented in [13].

4. The multi-level RAID analytical model

Our aim is to determine the mean logical request response time for data stored according to RAID0-1 and RAID5patterns in a single multi-level RAID storage system. We consider relevant hardware parameters and requests’execution schedules, for which we give task graphs to highlight the one or two synchronization points. We thendetermine the mean logical request response time using the fork-join model of Section 2 in an M/G/1 queueingcontext.

4.1. Mean response times

Each disk is modelled by an M/G/1 queue of physical requests. It serves tasks comprising both read/write requestsand parity pre-read/update requests. Each physical request relates to a number of blocks of data (the physical requestsize) and leads to a single disk access in read or write mode. The response time of each physical request is composedof four components: the waiting or queueing time in the disk queue (Q), the seek time (S), the rotational latency (R)

and the transfer time which we separate into two components (T and t).

• Queueing time, QiSince each disk is modelled by an M/G/1 queue, the mean queueing time is calculated using the Pollaczek-

khinchin formulae [15], extended to handle multiple classes:

E[Qi ] =

∑j=1,5

λi j X i j

2(1 − ρi )(9)

where, referring to Table 3,– X i j = Si + Ri + Ki j t is the total head displacement time random variable for a RAID j request on disk i , with

mean X i j and second moment X i j ;– ρi =

∑j=1,5 λi j X i j is the traffic intensity on disk i ;

– λi = λi1 + λi5;– Ki j is the random variable denoting the number of blocks in a physical request of RAID j type ( j = 1, 5), a

workload parameter.

Page 10: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 673

These quantities depend on the moments of X i j and hence on those of Si and Ri . They are determined in the nextsubsection. Notice that a write generates extra traffic because of the additional I/O transfers required: either writesto a mirror disk or pre-reads, depending on the RAID variant in use. This is reflected in the value of λi , which isconsidered separately, for each RAID variant.

• Seek time, SiThe seek time depends on the distance D between the current position of the device’s read/write head and the

target position. It is commonly calculated, for hardware such as we are considering, according to [5] as follows:

S =

{0 if D = 0a + b

√D otherwise

(10)

where a, b are hardware-related constants. We assume that the incoming logical requests’ addresses areindependent random variables, uniformly distributed over the disk-address space. The distance D can then bewell approximated by a continuous random variable with density function

fD(x) = psδ(x) + (1 − ps)2(C − 1 − x)

(C − 1)2 (0 ≤ x ≤ C − 1). (11)

The term 2(C−1−x)

(C−1)2 is the probability density function of the difference between two uniform random variableson [0, C − 1], where C is the number of cylinders on disk i and δ(x) is the Dirac delta-function (unit impulse).For simplicity, we have assumed that all disks have the same hardware parameters, including a, b, C . However,it would be easy to extend our model to heterogeneous devices. The quantity ps is the probability that a givenphysical request addresses the same track as the previous one, i.e. requires no seek. It is a workload parameter thatwe estimate by 1/C , consistent with our assumption of uniformity. Its effect is therefore negligible here.

• Rotational latency, RiThe rotational latency is assumed to be a random variable with Uniform distribution on the interval [0, RMAX],

with density function:

fR(x) =1

RMAX0 ≤ x ≤ RMAX (12)

for all disks i .• Single block bus transfer time, T

Assuming negligible contention on buses, the block transfer time is a constant T , denoting the sum of thetransfer time of the device (from the disk buffer to the bus) and the bus transfer time (on the bus connecting thedisk to the disk system manager).

• Single block disk transfer time, tThis is the time it takes to transfer one block to or from a cylinder.

4.1.1. Mean valuesFrom the above probability density functions, we calculate the following by direct integration:

X i j = Si + Ri + Ki j t

Si = (1 − ps)

(a +

8b√

C − 115

)

Ri =RMAX

2.

Since the quantity t is small, the contribution of the quantities Ki j is negligible and we may take Ki j = 1 withprobability 1 for every i, j . Then we have X i j ' X i for j = 1, 5 and

E[Qi ] =λi X i

2(1 − ρi ).

A more sophisticated approximation could be obtained by workload profiling, to estimate the moments of the Ki j ,and be relevant for physical requests with very large numbers of blocks.

Page 11: Queueing models of RAID systems with maxima of waiting times

674 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

4.1.2. Higher momentsWe denote the nth moment of a random variable by n overbars, cf. the mean values used above. To approximate the

mean of the maximum of non-exponential random variables by the method of Section 2.3 requires the second momentof the queueing time, Qi = E[Q2

i ]. As we shall see, this requires the third moments of Si and Ri .In an M/G/1 queue with arrival rate Λ, service time random variable X with distribution function X (t) and

Laplace–Stieltjes transform (LST) X∗(θ), the queueing time Q has distribution function with LST given by (see [15],for example):

(θ − Λ(1 − X∗(θ)))Q∗(θ) = (1 − ρ)θ.

Differentiating twice with respect to θ and setting θ = 0 gives Q = ΛX/(2(1 −ρ)) leading to Eq. (9). Differentiatingthrice at θ = 0 gives the required second moment:

Q =Λ2 X

2

2(1 − ρ)2 +ΛX

3(1 − ρ).

We already have the mean values X i j and now need to calculate the corresponding results for the second and thirdmoments. First we neglect the small quantity t and consider the random variable Yi = Si + Ri and calculate:

Yi = Si + Ri + 2Si Ri

Yi = Si + Ri + 3Si Ri + 3Si Ri .

It remains to calculate the second and third moments of Ri and Si , the first moments being as given in the previoussubsection. For the uniform random variable Ri ∈ [0, RMAX], we have

Ri = R2MAX/3, Ri = R3

MAX/4.

The nth moment of Si is 2(1 − ps)/(C − 1)2 ∫ C−10 (C − x)(a + bx1/2)ndx , giving after some manipulation:

Si = (1 − ps)

[a2

+1615

ab√

C − 1 +13

b2(C − 1)

]Si = (1 − ps)

[a3

+85

a2b√

C − 1 + ab2(C − 1) +8

35b3(C − 1)

√C − 1

].

Extension to the more general case X i j = Yi + Ki j t is similar, where the first three moments of Ki j are estimatedby profiling the offered workload and customizing to the RAID level j as in Sections 4.2 and 4.3 for levels 0–1 and 5respectively. We obtain:

X i j = Yi + t Ki j

X i j = Yi + t2 Ki j + 2tYi Ki j

X i j = Yi + t3 Ki j + 3t2Yi Ki j + 3tYi Ki j

and in the special cases we consider where Ki j = 1 with probability one,

X i j = Yi + t

X i j = Yi + t2+ 2tYi

X i j = Yi + t3+ 3t2Yi + 3tYi .

4.2. Mean response time on RAID0-1

4.2.1. One-block read requestsThe disk system manager sends the one-block read request to only one disk of the concerned pair (native data disk

or mirror one). The choice of the target disk can be according to one of many policies, such as random, the shortest

Page 12: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 675

Fig. 3. RAID0-1 read request task graphs.

queue or the smallest seek. In our study, we consider random disk selection. The response time and its average for aread request on disk i among the N comprizing the RAID are formulated as:

Zr (i) = Qi + Yi + T + t (13)

Zr (i) =

∑j=1,5

λi j X i j

2(1 − ρi )+ (1 − ps)

[a +

8b15

√C − 1

]+

RMAX

2+ T + t. (14)

4.2.2. Multiple-blocks read requestsThe response time of a multiple blocks logical read is the maximum of the physical requests’ response times on

each disk in the set requested. A logical read is achieved when its last associated physical request is finished, asshown in Fig. 3. On a RAID0-1, the disk system manager can exploit the parallel access to native and mirror disks bychoosing either one to read a given block. The response time of such a B-blocks read request is therefore estimatedby the expression:

Zr =k

maxi=1

(Qi + Yi ) + nb bloc × (T + t)

where nb bloc = dBk e is the number of blocks per disk and the number of disks involved in the read is k = min(B, N ).

Notice that, for (N/2 < B < N ), there is still only one block transferred from each of B disks if the right choicesbetween native and mirror disk are made (assuming a sensible disk system manager that dispersed them suitably whenwriting). We assume that each disk transfers the same number of blocks, the error being less than a one-block transfertime and the relative error approaching zero at large B. The mean response time may now be approximated using themethod of Section 2.3 by:

Zr = I (k, α, M) + nb bloc × (T + t)

where, for 1 ≤ i ≤ k,

αi =1

Qi + Yi; Mi = Qi + Yi + 2Qi Yi .

Page 13: Queueing models of RAID systems with maxima of waiting times

676 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Fig. 4. RAID0-1 write request tasks graph.

This approximation is exact for exponential response times in the individual queues, as in the very special case ofM/M/1 queues. The general shape of the distribution could be expected to approximate an exponential in somesystems, there being ‘more small than large blocks’. However, the exponential result (Corollary 1) was found to giveserious overestimations and the use of the above approximation caused a dramatic improvement. This is because thecoefficients of variation of disk seek time, rotational latency and, hence, queueing time are significantly less thanunity. As we saw in Section 2.4, Table 1, the overestimate of the exponential assumption is improved by more thanhalf by the approximation that takes into account the second moment. This approximation is therefore what we usedin our empirical studies of Section 5.

4.2.3. One-block write requestsThe disk system manager sends the one-block write request to both the native data disk and the mirror disk, as

shown in Fig. 4. The response time of the one-block logical write is the maximum of the physical requests’ responsetimes on each disk of the pair, i.e.

Zw =2

maxi=1

(Qi + Yi ) + T + t.

The average response time is then estimated by:

Zw = I (2, α, M) + T + t.

4.2.4. Multiple blocks write requestsAccording to the execution schedule shown on the task graph (Fig. 4), the response time of a B-block write request

is:

Zw =k

maxi=1

(Qi + Yi ) + nb bloc × (T + t)

with mean value

Zw = I (k, α, M) + nb bloc × (T + t)

where nb bloc = d2BN e and k = min(2B, N ).

Page 14: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 677

Fig. 5. RAID5 read request tasks graph.

4.2.5. Overall mean response time on RAID0-1The mean response time for both reads and writes depends on the probability pi1 that disk i holds a given RAID-1

block and λi1, the rate of physical requests generated at such a disk (referring to Table 3). Assuming a homogeneoussystem, we have pi1 = 1/N and, for a logical request size of B blocks

λi1 = λR1(pr min(B, N ) + pw min(2B, N ))/N .

Notice how the min terms correspond to the k used above.This completes the set of parameters required to run the queueing models and from it we obtain the mean overall

response time:

Z = pr Zr + pw Zw.

4.3. Mean response time on RAID5

The RAID5 level is fine-grained, splitting disk access evenly across all disks in stripes (one block per disk), eachwith its own parity block. Thus, a logical write request of B blocks, is split into bB/(N − 1)c ‘full’ stripes, of N − 1native data blocks and one parity block each, and one partial stripe of B%(N − 1) data blocks and one parity block,where the symbol % denotes the ‘modulo’ arithmetic operation. Thus, full stripes cover all the disks (all but one withnative data) whereas partial stripes do not. The disk system manager sends each block read request to its uniquelydetermined data disk, while it sends a write request to the data disk concerned and the corresponding parity updaterequest to the associated parity disk.

4.3.1. Read requestsA read request on RAID5 is performed as on RAID0-1 when the target disk is known; there is no choice on RAID5.

There is only one access to each data block concerned, as shown in Fig. 5.Thus, we use the following expression to calculate the mean response time for a read request.

Zr =k

maxi=1

(Qi + Yi ) + nb bloc × (T + t).

The mean response time is then:

Zr = I (k, α, M) + nb bloc × (T + t)

where nb bloc = dB

N−1e and k =

{B if B ≤ (N − 1)

N otherwise.

Page 15: Queueing models of RAID systems with maxima of waiting times

678 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Fig. 6. RAID5 write request tasks graph.

4.3.2. Write requestsWrite requests are handled differently for full and partial stripes since only full stripe writes, which involve all data

disks, avoid the pre-read process. Partial stripes can be further subdivided into small stripes, whose native data occupyless than half the disks available for data – i.e. with less than (N − 1)/2 blocks – and large stripes, whose data coverat least half. The RAID5 write execution schedule is illustrated in Fig. 6.

In order to maintain data coherence, parity is updated according to the rule:

new parity block = old data block ⊕ old parity block ⊕ new data block

where ⊕ denotes the exclusive-OR function. The redundancy disk can be any one of the N disks at each stripe.Assuming there is no skew, there can be at most one large or small write (on the last stripe) and we distinguish the

three forms of stripe writes by the following terms:

cdtfull =

{1 if B ≥ (N − 1) > 00 otherwise

cdtlg =

{1 if B%(N − 1) ≥

N − 12

0 otherwise

cdtsm =

{1 if B%(N − 1) <

N − 12

0 otherwise.

The response time for a logical write request on RAID5 is then:

Zw = cdtfull(Zfull + cdtlg Zlg+ + cdtsm Zsm+) + (1 − cdtfull)(cdtlg Zlg + cdtsm Zsm). (15)

Hence,

Zw = cdtfull(Zfull + cdtlg Zlg+ + cdtsm Zsm+) + (1 − cdtfull)(cdtlg Zlg + cdtsm Zsm)

where the random variables Z are the following conditional response times:

Zfull for the full stripes write.

Page 16: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 679

Zlg for an exclusive large stripe write.Zsm for an exclusive small stripe write.Zlg+ for a large stripe write after a full one.Zsm+ for a small stripe write after a full one.

Now we consider these terms further.Zfull All disks are accessed and no pre-read is needed because parity is already calculated. The mean response time

for a full stripe write is:

Zfull =N

maxi=1

(Qi + Yi ) +

⌊B

N − 1

⌋× (T + t)

Zfull = I (N , α, M) +

⌊B

N − 1

⌋× (T + t).

Zlg All disks are accessed once (either for a pre-read or for a write) apart from the parity disk, which is accessedtwice (pre-read the old parity and write the new one). The mean response time is:

Zlg = pre read + RMAX + T

where

pre read = I (N − k, α, M) + T + t

and k = B%(N − 1).Zsm Only the (1 + B) disks (B data disks and one parity disk) are accessed but all of them are accessed twice (for

the pre-read, then for the write). The mean response time estimate is then:

Zsm = pre read + RMAX + T

where

pre read = I (1 + k, α, M) + T + t.

Zlg+ For a large write following a full stripe write, one block is accessed on every disk (as for Zlg) but the paritydisk must be accessed twice. However, at the end of the full stripe write, the parity disk’s head is at the rightblock and so the additional delay is simply:

Zlg+ = 2T + t + RMAX.

We consider that a full rotation time is enough to obtain the new redundancy data to update the parity after thepre-read process.

Zsm+ Similarly, for a small write following a full stripe write, one rotation is enough to perform the write and updatethe parity, so that:

Zsm+ = 2T + t + RMAX.

In a homogeneous system, the probability that a given RAID-5 block is on any disk i is pi5 = 1/N and the arrivalrate of physical requests to disk i is, for a logical request size of B blocks:

λi5 = λR5 ( prκr (B) + pwκw(B) ) (1 + pw)/N

where κr (B) = min(B, N ) and κw(B) =

{N if 2B ≥ N − 1B + 1 otherwise.

Remarks• In RAID5 systems, a disk is also involved when it contains the corresponding parity for the data to be written,

which justifies the term pw/N .• For the small or large writes following a full write, these second accesses have no seek and their rotation latency

(which includes t , the write or read duration) is the constant RMAX, i.e. one rotation.• Note that the parity block always relates to a full stripe and is calculated so as to minimize the number of

operations involved. Thus, for a large write, the non-participating blocks (on less than half the disks) are pre-read and combined with the (known) blocks to be written to form the new parity block. Conversely, for a smallwrite, the new parity can be determined just from the old parity, the new blocks to be written and their previousvalues, i.e. again no more than N/2 disks need be accessed.

Page 17: Queueing models of RAID systems with maxima of waiting times

680 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Fig. 7. Small requests (B = 1), RAID0-1.

5. Results and discussion

In order to validate our model and assess its accuracy, we developed a detailed event-driven simulator. Thissimulator is written in C and is composed of three main parts. The first part is a logical request generator, which usesstandard random number generation functions to produce inter-arrival times for the logical requests with arbitraryprobability distributions. The second part is a logical to physical mapping, which contains all the physical requestgeneration functions. This part deals with the different access modes and rates of the physical requests, according tothe redundancy (RAID level) associated to the requested storage area. The third part is the simulation engine, whichschedules the execution of physical requests on (operational abstractions of) the disks, as specified in Section 3, andmanages synchronization. We obtained the hardware parameters from a library, which we separated from the executionroutines in order to enhance the flexibility and the scalability of the simulator.

We generated workloads with different mean logical request sizes (measured in blocks of 4 KB each), using sizesof 1, 2, 8 and 16 blocks to represent minimum and small-to-medium requests. It would also be interesting to use biggersizes (going up to 250 blocks) to represent medium-to-large requests. In fact, the upper bound is around 1 MB forthe large requests observed in image applications. We excluded such big requests because applications handling themdon’t use the RAID levels considered here but RAID3 instead. Concerning the balance between reads and writes in theworkload, we generated model inputs with three ratios: 50% of reads for well balanced read/write write workloads,0% of reads for exclusively write workloads and 100% of reads for exclusively read workloads. Last, for the resultspresented in this paper, we used an array of 16 disks. The characteristics of the disks we used are: number of cylindersC = 1200; full rotation time RMAX = 16.7 ms; number of blocks per track (bpt) = 12; acceleration time a = 3 ms;seek factor b = 0.5 and one block transfer time T = 1.34 ms. We chose this parameterization in order to compareour results with those in [5]. Any modifications needed for testing more modern disks are straightforward, and moreadvanced architectures, e.g. with variable sector sizes according to cylinder, can be handled with an adapted model.Notice that we simulate the physical operation of a real RAID system, not the queueing model abstraction detailed inthe previous section. All service times are taken from the operational characteristics of the system, which are modelledexplicitly in the simulation and aggregated in the analytical model. To validate our analytical model, we first assumedexternal Poisson arrivals of the logical requests and then validated this assumption by considering non-exponentialinter-arrival times in Section 6.

Simulations were run for a warm-up period of 300,000 logical requests to allow the system to reach a stable state.They were then run for a further 700,000 logical requests during which the measurements concerning response timewere gathered. The confidence bands are quite narrow and are omitted here; the simulation runs each took days togenerate all the points on each graph on 3 GHz processors. However, the regions where there is any disagreementbetween the simulation and the analytical model are apparent. Figs. 7–18 compare the mean response time predictedby the analytical model with that obtained by simulation for small-to-medium request sizes under RAID0-1 andRAID5 redundancy levels and for various ratios of read to write disk accesses.

Page 18: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 681

Fig. 8. Small requests (B = 1), RAID5.

Fig. 9. Small requests (B = 2), RAID0-1.

Fig. 10. Small requests (B = 2), RAID5.

Figs. 7 and 8 illustrate the effect of the workload’s read/write ratio in a small-request environment for RAID0-1and RAID5 respectively; the mean logical request size is one block (B = 1). The model and the simulation responsetimes show good agreement. RAID0-1 is clearly superior when writes are considered, as expected with such small

Page 19: Queueing models of RAID systems with maxima of waiting times

682 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Fig. 11. Medium requests (B = 8), RAID0-1.

Fig. 12. Medium requests (B = 8), RAID5.

Fig. 13. Full stripe requests (B = 16), RAID0-1.

requests where the extra complexity of the RAID5 redundancy and striping-based scheme leads to a penalty ratherthan a benefit. This is the well known RAID5 ‘small write cost’. For 100% reads, the two raid levels show almostidentical performance. Comparing with Figs. 9–12, we can see how the system behaves in small vs. medium request

Page 20: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 683

Fig. 14. Full stripe requests (B = 15), RAID5.

Fig. 15. RAID0-1 = 75%, RAID5 = 25%, small requests (B = 1).

Fig. 16. RAID0-1 = 75%, RAID5 = 25%, medium requests (B = 8).

size environments (request sizes 2 and 8 blocks). We deduce that the workload thresholds (above which performancedegrades very rapidly to unacceptable levels) decrease considerably with the increase in request size; by a factor alittle less than the request size (two and eight blocks). RAID5 is penalized less on workloads with writes, suggesting a

Page 21: Queueing models of RAID systems with maxima of waiting times

684 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Fig. 17. RAID0-1 = 25%, RAID5 = 75%, small requests (B = 1).

Fig. 18. RAID0-1 = 25%, RAID5 = 75%, medium requests (B = 8).

‘cross-over point’ at a higher request size, below which RAID0-1 is the better scheme overall and above which RAID5gains increasing superiority. In Figs. 13 and 14, we consider request sizes of 16 and 15 blocks, which use all 16 of thedisks. For 16-block RAID0-1 reads, every disk is accessed exactly once for one block — on either a native or a mirrordisk. For writes, two blocks are written on every native and mirror disk. The queueing times and seek times are thesame in each case, but two blocks are transferred to each disk on a write compared with only one being transferredfrom each disk on a read. The difference amounts to one rotation time T plus the small block-transfer time t , i.e. atotal difference of 2.73 ms. The curves in Fig. 13 therefore appear almost coincident. RAID5 15-block reads access 15disks, one block on each. RAID5 15-block writes access all 16 disks, one block on each, to include the parity block.In fact,considering such requests sizes, for RAID0-1, only the number of blocks transferred changes, according to pr .For RAID5, the full stripe access strategy makes it close to RAID0-1 in terms of response time. Thus, at 15 blocks,requests have almost the same response time with either RAID level but, of course, RAID5 is far more efficient onstorage, using half as much.

The last two pairs of figures, Figs. 15–18, show the effect of the choice of the ratio between RAID partition sizeson the whole storage system’s performance, for request sizes of one and eight blocks. The complementary partitionchoices (75% of RAID0-1 and 25% of RAID5 in Figs. 15 and 16 against 25% of RAID0-1 and 75% of RAID5 inFigs. 17 and 18) show how response time is dominated by the larger partition. Note that the given fraction of thestorage space (the partition size) implies an equal fraction of incoming requests to this partition. That is, the partitionworkload is proportional to its size.

Page 22: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 685

Table 4Comparison of moments of service time component Y = S + R

Model Simulation

Mean (1) 20.58 20.58Std dev (2) 6.15 6.12Cube root of 3rd central 1.91 2.11

In Fig. 17, 75% of the workload is allocated to RAID5, where, because of the small request size (one block),regrouping policies like full/large writes are inefficient. As a result, these costly writes (each one generates four diskaccesses) lead to a high response time compared to that obtained in the mixed system of Fig. 15, where the writes arepenalized less because only 25% are allocated to RAID5.

6. Sources of approximation

We have already investigated in Section 2.4 one possible source of inaccuracy in our model, namely the mean–maxapproximation of Section 2.3, which is only exact for parallel exponential delays. We concluded that only forcoefficients of variation (ratio of standard deviation to mean) much less than one is the approximation likely to bepoor. Fortunately this is the least likely scenario, file access times being notoriously variable, sometimes even havingheavy tailed distributions.

However, there are other potential causes for inaccuracies in the model, which we now address: inaccurateapproximation for the moments of response time at a single disk, which we considered to be an M/G/1 queuewith service times given by particular formulae for seek time and rotational latency, and dependence between theseresponse times. We also investigate the robustness of the assumption of Poisson arrivals by comparing our results withsimulations having non-Poisson arrivals.

6.1. Moment estimation at individual disks

First we assess the error introduced in approximating the delay experienced by a physical request at a single diskby the response time of an M/G/1 queue. This is a relatively simple response time to calculate since it excludes anyadditional delays waiting for the synchronization with parallel requests to complete a ‘join’ operation. However, it iscrucial as a component of the set of delays maximized and is itself given by the service time of the M/G/1 queue.

6.1.1. Moments of service timeService time, X , is defined as the sum of seek time, S, rotational latency, R, and transfer time, K (T + t), where

K is the number of blocks transferred. We write Y = S + R and X = Y + K (T + t) to be consistent with previousnotation. There is no problem with the precision of the transfer time component since T and t are known constantsand K is a control parameter in our experiments. We therefore compared the random variables S, R and their sumY in the analytical and simulation models. These are obviously independent of the arrival rate of logical requests, λ,and of other workload characteristics such as request size. We use the first three moments of these quantities in ourmodel and so compared their analytically computed values (see Section 4.1.2) with those estimated from simulationruns. This was done by simply dividing the sum of the i th powers of the simulated quantity by (one less than) thenumber of times the simulator generated it, to estimate the i th moment, i = 1, 2, 3. To isolate the higher order effectsand make a consistent comparison, we used central i th moments, raised to the power i−1, for i > 1; this gives thestandard deviation for i = 2. We obtained the following results, shown in Table 4:

We see that the agreement is excellent for the mean and standard deviation. The cube root of the third centralmoment is a more sensitive quantity to estimate and we conjecture that the small error we see in it arises fromsimulation runs that are too short. Of course, any error would only affect second moments of queueing times.

The question now becomes, how well will response times match at higher loads when they may be composed ofmany service time random variables?

Page 23: Queueing models of RAID systems with maxima of waiting times

686 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Table 5Comparison of moments of seek and latency times

Seek time S Latency RModel Simulation Model Simulation

Mean (1) 12.23 12.22 8.35 8.35Std dev (2) 3.82 3.84 4.82 4.75Cube root of 3rd central 1.91 2.07 0.20 0.68

Fig. 19. Standard deviation of queueing time at a single disk.

6.1.2. Moments of queueing time at a single diskTo be able to use our approximate mean–max formula of Section 2.3, we require the rate α, the reciprocal of

Q + Y , and the corresponding second moment M2 = Q + 2QY + Y . This requires the first two moments of thequeueing time Q, which is given by the first three moments of the service time. The formulae used for the momentsof Q require assumptions about the operation of the physical disks (relating to seek times and rotational latency) andrely on standard properties of the M/G/1 queue (see Table 5). Since the mean and standard deviation of service timematch very closely in the model and simulation, the mean queue length will also, all assumptions of the M/G/1queue being satisfied in the simulation. We therefore plotted the graphs of the standard deviation of queueing time at

a single disk (√

Q − Q2) against the external arrival rate of logical requests, λ, for various request sizes and RAID

levels. These graphs again showed excellent agreement and one example plot is shown in Fig. 19.

6.2. Dependence of parallel queues

The next concern is the dependence between the queues, caused by the synchronised arrivals of the physicalrequests spawned by a logical request. We cannot assume that the collection of response times, of which we estimatethe mean of the maximum, is independent. For example, if service times are constant and arrivals are synchronized,i.e. always occur simultaneously at each of a set of disks, every disk will behave identically and so all responsetimes will be the same in the maximized set. Thus the maximum response time will be that of a single disk nomatter how many disks we have in the RAID system. The same applies for any service time distribution if theservice time of each of the synchronized parallel requests is the same. Our mean–max approximation, however, willdiverge (logarithmically) as the number of disks increases, giving an infinite error! This is essentially the low-variancesituation we considered in Section 2.4, but here it implies that we cannot ignore dependencies between arrivals,even though we would be unlikely to encounter such extreme circumstances in practice because of asynchronouspositioning of disk heads and the interleaving of logical requests requiring only subsets of the RAID array, even onedisk.

An analytic assessment would appear to be out of the question, requiring the joint distribution of queue lengths at anarbitrary number of queues for a start, then a multidimensional analysis of response times using either supplementary

Page 24: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 687

Table 6Comparison with Erlang

N Exp Erlang-2 Erlang-4Ind Dep Ind Dep Ind Dep

2 1.62 1.62 1.47 1.46 1.35 1.344 2.26 2.23 1.90 1.87 1.65 1.628 2.95 2.89 2.34 1.29 1.95 1.89

16 3.66 3.58 2.80 2.71 2.25 2.15

Table 7Comparison with Pareto and constant

N Pareto-4 Pareto-5 ConstantInd Dep Ind Dep Ind Dep

2 1.77 1.77 1.73 1.72 1.08 1.044 2.61 2.62 2.53 2.51 1.15 1.048 3.73 3.68 3.51 3.46 1.28 1.04

16 5.06 4.96 4.66 4.57 1.47 1.04

variables or possibly finding some embedded Markov chain, as in a single M/G/1 queue. Consequently we again usesimulation. The following experiments were conducted for various service time distributions G, including exponential,Erlang, Pareto and deterministic (constant), with parameters taken from the set used in Section 2.4, all with unit mean:

• For N = 2, 4, 8, 16, calculate the mean–max of the response times of the N independent M/G/1 queues (exactlyas in our model), each with arrival rate λ (for various λ);

• For the same set of values of N , calculate the mean–max of the response times of the same N fully synchronisedM/G/1 queues — in other words, there is a single Poisson arrival process with rate λ which generates an arrivalto every queue at each arrival instant.

These two scenarios represent the extremes of observable behaviour. In practice, not all disks will be involvedfor every request, leading to asynchronous behaviour that one would expect to be better approximated by assumingindependence.

From the Tables 6 and 7 we observe that there is not a great difference between the scenarios except for the higher-order Erlang and deterministic (unit response time) cases, which have small variance (zero in the latter case). This isconsistent with our observations in Section 2.4.

6.3. Non-Poisson arrivals

For our final test, we relaxed the Poisson arrival requirement, by simply using alternate arrival processes in thesimulation. These were parameterized so that we could plot analogous graphs to those of Section 5, using the same setof arrival rates λ. We used the following interarrival time distributions, each with mean interarrival time 1/λ: Erlang-n(nλ) for n = 2, 4; generalized exponential GE(p, pλ) with distribution function 1 − pe−pλt for p = 0.5; and the2-phase Interrupted Poisson Process IPP(Q, 2λ) with generator matrix Q =

[−1 11 −1

]modulating the two phases in

which the arrival rates are 0 and 2λ — again giving average arrival rate λ. The GE distribution gives Poisson arrivalsof batches with geometrical size (p = 1 gives unit batches, as in a Poisson process, smaller p gives larger batches)and the IPP, gives correlated traffic.

We found mean response time to be fairly insensitive to the particular distribution of inter-arrival time, depending(almost) only on its mean value (i.e. on the arrival rate). We can see from Fig. 20 that the GE and IPP curves showslightly higher response times, which is predictable because of the bursty/batch characteristic of such distributions.However, the difference represents 2.29% and 7.33% (for GE and IPP respectively) of the response time with poissonarrivals at high arrival rates. In fact, the Poisson arrival assumption has often been found to be robust, especially formodelling external, user-generated, logical requests because such external traffic is usually composed of a numberof low intensity streams that behave independently. The superposition of such sparse streams can be shown toapproximate a Poisson process under quite mild assumptions; see for example [14].

Page 25: Queueing models of RAID systems with maxima of waiting times

688 P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689

Fig. 20. Poisson vs non-poisson arrival distributions.

7. Conclusion

We have developed a new, efficient approximation to compute the mean duration of certain synchronized fork-joinoperations. The approximation is exact in the case of exponentially distributed constituent delays, where exact resultswere also obtained for higher moments and the Laplace transform of the duration’s probability density function itself.Using these results, quite intricate, analytical models, based on simple queueing theory, were derived, which take intoaccount the detailed principles of operation of each of the two main RAID storage management systems in popularuse. Moreover, we combined the models in order to study a heterogeneous architecture where different RAID levelscoexist on the same storage system to provide an efficient exploitation of the storage space.

Analytical results were compared with simulation at a very fine level of abstraction and showed very good agree-ment at low–medium loads for a range of request sizes and read–write access ratios. In addition, the model predictedthe onset of saturation well, i.e. the level of loading above which response time grows rapidly to unacceptable levelswhereupon poor quality of service ensues. Apart from this overall comparison, specific assumptions of the analyticalmodel were carefully checked individually and the most serious causes of inaccuracy were identified. We considered• the accuracy of the mean–max approximation, central to the model;• the precision in our estimates of seek and rotational latencies, based on standard sub-models, as well as their

compounded effect on queueing time;• the effect of synchronization between parallel fork-join queues.

We concluded that the main causes of inaccuracy were the first and third of these, the effect only being seriouswhen disk service times were fairly consistent and so relatively predictable, i.e. having small variance. This is rarelythe case with today’s disk access patterns. We also checked the robustness of our assumption that external requestsarrive as a Poisson stream, finding that mean response time is sensitive primarily to just the arrival rate rather than tothe particular distribution of inter-arrival time.

In the calculation of the mean of the maximum of an independent set of random variables, the rate and secondmoment parameters αi , Mi of Section 2 are (in general) distinct, corresponding to a heterogeneous array of disks. Inthis study we assumed equal parameters, giving a simple non-recursive result, but it requires a controlled experiment toensure that all the workload parameters are the same at every disk. In fact, the disk-selection probability for a physicalrequest, pi j (see Table 3) is particularly sensitive to workload variations and choice of RAID level, influencing thearrival rate at each disk i . An optimization of the general calculation is the subject of work in progress; Eq. (6) providesa starting point when it is possible to group a large collection of disks into subsets which are almost homogeneousin terms of both hardware specification and loading. It is also important to further consider the extent of the error inthe mean–maximum calculation. This could be done by further simulation for various constituent distributions, but,especially for low variances, a comparison against exact results in the phase-type case could be carried out, using themethod of [3], for example, cf. Section 2.4. Indeed, in small models, the phase-type method itself might be used.

Finally, we are extending the study to a dynamic and heterogeneous storage system, dealing with the layout schemesand reconfiguration necessary for a RAID scheme that adapts to its varying offered workload. This work also includes

Page 26: Queueing models of RAID systems with maxima of waiting times

P. Harrison, S. Zertal / Performance Evaluation 64 (2007) 664–689 689

the representation of much larger request sizes. We will then be able to evaluate the overheads of the related datamigration and communications.

References

[1] E. Bachmat, J. Schindler, Analysis of methods for scheduling low priority disk drive tasks, in: Proc. ACM Sigmetrics, 2002.[2] The RAID Advisory board, The RAIDBOOK: A Source Book for RAID Technology, Lino Lakes MN Publisher, 1993.[3] H. Bohnenkamp, B. Haverkort, The mean value of the maximum, in: Proc. PAPM/Probmiv 2002, in: Lecture Notes in Computer Science,

vol. 2399, Springer-Verlag, 2002, pp. 37–56.[4] S. Chen, D. Towsley, The design and evaluation of RAID5 and parity striping disk array architecture, Journal of Parallel and Distributed

Computing 17 (1993) 58–74.[5] S. Chen, D. Towsley, A performance evaluation of RAID architecture, IEEE Transactions on Computers 46 (1997).[6] S. Chen, Design, modeling and evaluation of high performance, Ph.D. Thesis, University of Massachusetts, September 1992.[7] G. Gibson, D.A. Patterson, R.H. Katz, A case for redundant arrays of inexpensive disks (RAID), in: Proc. SIGMOD Conference, 1988.[8] G. Gibson, D.A. Patterson, P.M. Chen, R.H. Katz, Introduction to redundant arrays of inexpensive disks (RAID), in: IEEE COMPCON, 1989.[9] J. Xu, E. Varki, A. Merchant, X. Qiu, An integrated performance model of disk arrays, in: Proc. International Symposium on Modelling,

Analysis and Simulation of Computer and Telecommunications Systems, MASCOTS, 2003.[10] J. Xu, E. Varki, A. Merchant, X. Qiu, Issues and challenges in the performance analysis of real disk arrays, IEEE Transactions on Parallel and

Distributed Systems 15 (6) (2004).[11] C. Han, A. Thomasian, Performance of two disk failure tolerant disk arrays, in: International Symposium on Performance Evaluation of

Computer and Telecommunication Systems, 2003.[12] P.G. Harrison, S. Zertal, Queueing models with maxima of service times, in: Proc. TOOLS Conference, 2003.[13] P.G. Harrison, S. Zertal, Calibration of a queueing model of raid systems, in: Proc. Practical Applications of Stochastic Modelling, PASM,

2004.[14] B.O. Shubert, H.J. Larson, Probabilistic Models in Engineering Sciences, John Wiley, 1979.[15] Ng.C. Hock, Queueing Modelling Fundamentals, John Wiley, 1996.[16] E.K. Lee, R.H. Katz, An analytic performance model of disk arrays and its application, Technical Report UCB/CSD 92/660, November 1991.[17] I. Mitrani, M. Hamilton, P. McKee, Distributed systems with different degrees of multicasting, in: Proc. WOSP 2002, 3rd International

Workshop on Software and Performance, 2002.[18] A. Merchant, P.S. Yu, An analytical model or reconstruction time in mirrored disks, Performance Evaluation 20 (1994) 115–129.[19] A. Merchant, P.S. Yu, Analytic modeling and comparisons of striping strategies for replicated disk arrays, IEEE Transactions on Computers

44 (3) (1995).[20] A. Merchant, P.S. Yu, Analytic modeling of clustered raid with mapping based on nearly random permutation, IEEE Transactions on

Computers 45 (3) (1996).[21] T.M. Wong, J. Wilkes, My cache or yours? Making storage more exclusive, in: Proc. USENIX Ann. Technical Conference, 2002.[22] S. Zertal, Dynamic redundancy mechanisms for storage customisation on multi disks storage systems, Ph.D. Thesis, University of Versailles,

France, January 2000.[23] S. Zertal, P.G. Harrison, Multi-level raid storage system modelling, in: Proc. 2003 International Symposium on Performance Evaluation of

Computer and Telecommunication Systems, Spects, 2003.

Peter Harrison is currently a professor of computing science at Imperial College, London where he became a lecturer in1983. He graduated at Christ’s College Cambridge as a Wrangler in Mathematics in 1972 and went on to gain Distinction inPart III of the Mathematical Tripos in 1973, winning the Mayhew prize for Applied Mathematics. He obtained his Ph.D. inComputing Science at Imperial College in 1979. He has researched into stochastic performance modelling and algebraicprogram transformation for some twenty years, visiting IBM Research Centres during two summers. He has written twobooks, had over 150 research papers published and held a series of research grants, both national and international. Theresults of his research have been exploited extensively in industry, forming an integral part of commercial products such asMetron’s Athene Client–Server capacity planning tool. Currently, his main research interests are stochastic process algebra,where he has developed the RCAT methodology for finding separable solutions, response time analysis and optimization

of fluid-based models. He has taught a range of subjects at undergraduate and graduate level, including Operating Systems: Theory and Practice,Functional Programming, Parallel Algorithms and Performance Analysis.

Soraya Zertal is a Lecturer in the Architecture and Parallelism research group in the PRiSM Laboratory at the Universityof Versailles, France. She obtained her Ph.D. in Computing Science from the University of Versailles in 2000, a Masterdegree from the University of Versailles in 1996 and an engineering degree from the University of Constantine in 1993. Herresearch interest include parallel architecture, storage systems specifically algorithms for data placement and performancemodelling using both simulation and analytical methods.