introduction - ucf computer sciencejwang/papers/pact_jiangling,yin_v8.… · web viewthe...

CPPL: A New Chunk Based Power Proportional Layout with Fast Recovery

Jiangling Yin, Junyao Zhang, Jeffrey W. Chenault, Jun WangDepartment of Electrical Engineering & Computer Science,University of Central Florida, Orlando, Florida 32816–2450

Email: jyin, junyao, [email protected]

ABSTRACTRecent years have seen an increasing number and size of data centers and cloud storage systems. These two corresponding outcomes are dramatically increasing energy consumption and normal disk failures from emerging facilities. For simplicity, current power-proportional layout solutions perform dynamic power management using one disk group as an operational unit. Each group consists of a number of disks that contain a unique copy of an entire data set. However, there are two major limitations: 1) Powering up/down a disk group at a coarse grained level could involve a lot of unneeded disks, which results in extra power and performance penalty. 2) In the event of disk failure, a whole standby disk group needs to be spun up to conduct recovery.

In this paper, we develop a new chunk-based power-proportional layout called CPPL to address the aforementioned problems. Our basic idea is to leverage current power-proportional layouts by employing declustering techniques to perform power management at a much finer grained level. CPPL includes a primary disk group along with a large number of non-primary disks. A primary disk group contains one copy of available datasets and is always active to service incoming requests. Other copies of data are placed into non-primary disks in a declustering fashion to enforce power-efficiency and fast recovery at a finer-grained level. Through comprehensive theoretical proof and experiments, we conclude that, compared with current solutions, CPPL can save much moreabout 30% energy at degraded mode, and achieve two times higher a much higher degree of recovery parallelism at a typical setting than current solutions.

KeywordsPower-proportionality, recovery parallelism, group declustering, data layout, chunk.

1. INTRODUCTIONRecent years have seen an increasing number and size of data centers and cloud storage systems, which raises the problem of energy consumption, adding substantially to an organization’s power bills and carbon footprint. In order to reduce energy consumption, increased efficiency and lower power modes for underutilized

resources are required. Energy proportionality was introduced by Antret al.[1] for evaluating the efficiency of computer power usage, that is, the power used should be proportional to the amount of work performed, even though the system is provisioned for the peak load.Achieving power proportionality in a CPU requires the support of dynamic voltage scaling [1], [2], [3], [4]. However, achieving energy proportionality in a storage system is very difficult. This is because most current hard drives do not work in multi-power states. There are very few two-rotation-rate hard drives at market. It is impossible to finely scale the power consumed on disks, which have less voltage scaling levels, usually only two states --- on/off. Thus, the disks continue to draw significant power when CPUs sit idle, which can be up to 50% of the peak power consumption [5]. A feasible alternative solution for large data centers is to use dynamic server provisioning techniques to turn off unnecessary servers to save energy [6-10]. Different from the CPU’s dynamic provisioning schemes, powering which disks off and which disks on depend on a specific data layout. This is because, at any point of time, all active disks in a storage system need to contain an entire data set to guarantee an uninterrupted service to incoming requests. In recent years, several researches developed group based power proportional layout schemes for storage systems. Lu et. al. [21] introduced a family of energy-efficient disk layouts for simple RAID1 data mirroring. Thereskaet. al.[11] developed a power-aware grouping data layout. The idea is to divide all disks into different groups with equal size. To maintain the power-proportional property, each group contains one copy of an entire data set. In each group, the data is replicated only once. Turning on any group can serve any data. Different from the power-aware grouping layout, Amur et. al. [12] developed a layout where every group size is different. To implement the ideal power-proportionality, the disk groups consist of an increasing number of disks. A primary replica group has the minimum number of disks, the second replica group has more than the primary one, the third replica group has more than the second one, and etc. All of the aforementioned layouts are using disk groups as an atomic power management unit for simplicity.

There are two major limitations in group-based power-proportional layout solutions. First, since the whole disk group is always either powered on or off, unneeded disks are often involved and result in unnecessary power consumption and performance penalty. Second, usually the group based layout needs to power on a whole group of disks even for a single failure. Recovering a failed disk is also slow due to the limited recovery parallelism across groups and no recovery parallelism within a group. In this paper, we develop a new chunk-based power-proportional layout called CPPL to address the aforementioned problems. Our basic idea is to leverage current power-proportional layouts by employing declustering techniques to perform power management at a much finer grained level. CPPL includes a primary disk group along with a large number of non-primary disks. A primary disk group contains one copy of available datasets and is always active to service incoming requests. Other copies of data are placed into non-primary disks in a declustering fashion to enforce power-efficiency and fast recovery at a finer-grained level. We define a set of theoretical rules to formally study the feasibility of implementing ideal power-proportionality in practical data layouts, and its relationship with fast recovery. Through a thorough theoretical proof, we find that an approximate power proportional layout is a feasible solution. CPPL is one of such representative solutions. In addition, we study specific CPPL layouts to address the disk load balancing issue by configuring the percentage of overlap data between primary disks (i.e., disks in a primary replica group, p disk in brief) and non-primary disks (i.e., disks not in the primary replica group, non-p disk in brief). Lastly, wWe conduct a comprehensive set of experiments viaon a disksim based framework using both real-world traces and statistical benchmarks. Our simulation experimental results show that Through comprehensive theoretical proof and experiments, we conclude that, compared with current solutions, CPPL can save about 30% energy at degraded mode, and achieve two times higher degree of recovery parallelism at a typical setting.CPPL can save much more energy and achieve a much higher degree of recovery parallelism than current solutions.

2. POWER-PROPORTIONAL STORAGE SYSTEMLoad balancing and fault tolerance[13, 14] are the main concerns in traditional storage systems, which are usually achieved by randomly placing replicas of each block on a number of nodes/disks comprising the storage system. Shifted Declustering[15] presented a concrete placement scheme obtaining these properties, and also kept the mapping efficiency. However, the data is absolutely distributed which prevents powering down subsets of disks to save proportional energy without disrupting data availability.

In order to discuss these properties by metric, we suppose that the total number of data chunks (a data chunk means a data unit) is Q, each with k replicas, are stored into a storage system that is running on a data center of n disks. We can name the data chunks with the ID C1, C2, C3…CQ, and name all disks with d1, d2…dn. The replicas of each chunk C i(1≤ i≤ Q) are stored onto different disks. With the replicas, we can use C i

j(1≤i ≤Q∧1≤ j ≤ k ) to represent the jth replica of chunk C i. 2.1 Ideal Power Proportional ServiceThis section discusses the possibility that the power used is proportional to the service performed by the storage system. Since powering-down or putting the disks in stand-by means the chunk data in the corresponding disks are not available for service and the users’ requests could be viewed as the total number of chunks retrieved, the service performed could be taken as the chunks retrieved through the storage system during a period. The service performed by disks is determined by two things --- the required data is available on that disk and the request on the disk does not exceed the maximum throughput.Observation1.It is impossible for a storage system to achieve power proportionality by powering-down or idling disks unless each disk stores exactly one data chunk.Proof: We evaluate the power proportionality from following discussion.

1. The number of full available chunk data in the storage system which could be provided for service is k*Q when all of the disks are powered on. During a period, suppose partial chunk data (service) will be requested and the number of available chunks for retrieval is x. Then the proportional chunks that need to be available is

xk∗Q

(0 ≤ x ≤ k∗Q )

2. The power used by the disks could be represented by the number of active disks, because a disk only has two states (off/on). We assume that all disks consume the same energy respectively if they are powered on at a period. Thus, the full or maximum power used on disks is n, that is, all disks powered ‘on’. Also, during a period, suppose not all disks are active to provide service and the number of active disks is y. Then the power proportion is

yn(0≤ y ≤n).

3. The power used is proportional to the service performed, which should satisfy

xk∗Q

= yn

That is:

y= n∗xk∗Q

= nk∗Q

∗x (0≤ x≤ k∗Q ,0 ≤ y≤ n ).

Given a fixed storage system, k, Q and n should be constants. Thus, the number of active disks ‘y’ is a linear function to the number of requested chunks ‘x’. Both x and y should be integers (0 ≤ x≤ k∗Q;0≤ y≤ n),

a. If n

k∗Q<1 , y will get non integer values and the

power will not be kept proportional.

b. If n

k∗Q>1 , y couldn’t get all the integer values

from 0 ton; this doesn’t make sense becausek∗Q<n , which means the number of disks is more than all chunks.

c. Ifk∗Q

n=1, k∗Q=n. x = y. Both the service and

the power could achieve proportionality.It is possible to achieve power proportionality whenk∗Q=n, which means each disk only stores exactly one data chunk. 2.2 Fault Tolerance IssueFault tolerance is an important property in storage systems. In order to evaluate the properties from metric, we define ρθas being the overlap data chunks of anyθ disks from n. ρθ is an important factor for disks to do recovery. This is because once the disks are failed, only these disks with overlap chunks can provide recovery data for the failed disks. For instance, with one disk failure, in order to achieve maximum recovery performance, any other active disks need to provide some data for recovery. Thus, the failed disk should at least have some overlap chunks with all other disks.2.2.1 Distributed ReconstructionLemma 2: If a layout satisfies that anyβ (2≤ β ≤ n ) disks contain the same number of data chunks with distinct chunk IDδ β, anyβ (2≤ β ≤n)disks have the same number of overlap data chunksvβ.Proof: We map theseβ disks toβsets containing data chunks’ ID and record them asS= (S1, S2 ,…,Sβ ).This statement can be proved

by introduction, |Si| represents the number of elements in setSi.The basisβ=2, δ 2=¿ S1∪ S2∨¿|S1|+|S2|−¿S1∩ S2∨¿. vβ=¿S1∩ S2∨¿|S1|+|S2|– δ2. Since

|S1|=|S2|=( k∗Qn ) (Assume that all disks have

the same capacity) and allδ 2is equal by the precondition, thusvβwill be a constant for any two sets from S.Assume that the statement holds forx−1 (2≤ x−1<β ¿, that is vx−1 being a constant for any number of sets smaller than x−1 in S. Through the inclusion-exclusion principle, we have

δ x=∑i=1

x

S i− ∑i , j :1 ≤i< j ≤ x

|Si ∩S j|+ ∑i , j , k :1≤i< j< k ≤x

|S i∩ S j∩ Sk|

−…+( (−1 )x−1 )|S1 ∩…Sn|

¿∑i=1

x

Si−v2∗ ∑i , j :1 ≤ i< j ≤ x

1+v3∗ ∑i , j ,k : 1≤ i< j<k ≤ x

1−…

+( (−1 )x−1 )v x

By the precondition, δ x is a constant, which letsvx be a constant. The proof is complete.Theorem 3. If a layout can support parallel recovery for anyθfailed disks, the disks will be laid with data chunks in a way that the difference ofρθshould be at most 1 for1<θ<n.Proof: Suppose the storage system entersa degradation model that x (n>x≥ 1) disks failed. We rename thesex disks as the order (d1, d2… dx). The system needs to recover thesex disks by requesting the available chunks from the remaining active disks. By parallel recovery, the data chunks onxfailed disks should be as different as possible, so that there are more available data on the active disks for recovery. The number of data chunks with distinct ID for any x(n>x>1¿ disks will be the same if parallel recovery can be kept for all disks without bias. According to lemma 2, thesex failed disks will have a constant of overlap data chunks vx. The

total count number of vx is (nx ) and the total

overlap data chunks based on x is(nx )∗vx.

In order to fully support fault tolerance, the chunks with identical ID’s should be stored on different disks. Thus, each group ofach differentx chunks with an identical chunk ID will count once and the total count for Q different data chunks should beQ∗(k

x ). Then

through the below equation

(nx )∗vx=Q∗(k

x )

vx=

Q∗k !x ! (k−x ) !

∗x ! (n−x ) !

n !We have

ρθ=¿⌊vx⌋or⌈vx⌉⌈vx⌉-⌊vx⌋ = 0 or 1.2.2.2 Power-Proportional ReconstructionIn this section, we demonstrate how a layout can satisfy the distributed reconstruction while supporting power proportionality. We specifically show how the overlap of chunks changes between any two disks with the change of powering-up disks. Suppose that m disks are active at time t and any pair of disks (i and j) from m have overlap chunks ofx i , j. The number of paired disks among m active disks is

∑i=0

m−2

∑j=i +1

m−1

¿(m2 ). Thus the sum of pair chunks among any

two disks is∑i=0

m−2

∑j=i+1

m−1

x i , j. According to the discussion in

theory 3, we have

∑i=0

m−2

∑j=i+1

m−1

x i , j=(m2 )∗x i , j

On the other hand, consider m active disks storing Q chunks. The total number of paired chunks is

Q∗(( mn )∗k

2 )Since the storage system should have high availability, that is, a k-way (k ≥ 2) replication storage that is at least able to provide (k-1) failure correction, it is required that there are no two replicated chunks located on any one disk. Thus, the sum number of paired chunks from the two perspectives will equate to each other.

(m2 )∗x i , j=Q∗(( mn )∗k

2 )x i , j=

Q∗k∗(m∗k−n)n2∗(m−1)

=Q∗k2

n2 −

Q∗kn2 ∗(n−k )

m−1In the above equation, k, Q and n in a stable storage system are usually constant x i , jis a function of m and increases with m. An example in Figure 1 is used to demonstrate how x i , j changes with m. In the example, k=4, n=12 and Q=9, for the values of x i , j smaller than 0, take it as 0 because negative value of x i , j means no overlap chunks for any pair of disks in the m active disks. From Figure 1, we found that overlap chunks on any two disks will be changed with the change of powering-up disks in the system. Specifically, if the disks are numbered from 0 to 11 in the example and m = 12 or 4, the overlap chunks that could be found between disk 0 and disk 1 will be 0.8 or 0.35 at different times, such that the chunks need to be redistributed on disk 0 and disk 1, which causes the storage to be unstable.

Figure 1. Overlap chunks shared by any two active disks for k=4, n=12, Q =9

As discussed in the introduction, there is the large challenge of migrating petabytes of data and switching on/off servers frequently. It is very hard to keep a fully distributed reconstruction while achieving power proportionality.

2.3 Group Based Power-Proportional Data-Layout PolicyGroup declustering was first introduced to improve the performance of standard mirroring, and then extended to multi-way replication for high throughput mediaserver systems [16]. It partitions all disks into several groups, and the number of groups is equal to the number of data copies. Each group stores a complete copy of all data. Compared to standard mirroring, the data in the first group is scattered across all of the servers in the second group rather than standard mirroring. Thereska, E., A. Donnelly, and D.

Narayanan[11] proposed the power-aware grouping data policy, which could be implemented through a group declustering scheme.

Figure 2. Three-way replication data layouts by group declustering

Figure 2 demonstrates an example of the group based layout. This layout could achieve power-proportionality at the group level since any group can be used to provide one copy of data service. The first problem of this method is that the disks in each group share nothing; it could not achieve a better reconstruction parallelism according to theory 3. For instance, when all disks are powered on in the busy periods, and disk 4 fails, disk 5 and 6 could not provide available data to disk 4 for recovery. The second problem is that it violates power proportionality during recovery because almost all of the disks in other groups need to be powered on when only one disk fails. For instance, if disk 1 in the first group fails, then all of the

disks in the second or third groups should be powered on for recovery. The third problem is that when the incoming request comes with bias which causes overload on a certain disk, the system may need to power on another entire group of disks to share the workload of the busy disk. This is because the data on the busy disk is evenly scattered across other group disks.

3. DESIGN OF CHUNK BASED POWER PROPORTIONAL Layout (CPPL)Our proposed scheme is to follow theorems we previously discussed. Firstly, the proposed scheme keeps the fast recovery property of the storage system, which is what we discussed in Section 2.2, and the combination theory is applied to select disks to place chunks. Then, we modify the layout scheme to support power proportionality. According to the observation in Section 2.1, we know that it is impossible to achieve absolute power proportionality. Thus, the design goal of the proposed scheme is to approximate power proportionality as much as possible, that is, to maximize the efficiency of additional power usage if the system has to power on another disk.

Table 1.Theplacement of 42 chunks to 7 disks with combination sets

Disks chunks chunks Disks chunks chunks{d1 d2}

{C11C1

2} {C221 C22

2 } {d3 d4}

{C121 C12

2

}{C33

1 C332 }

{d1 d3

} {C2

1C 22} {C23

1 C232 } {d3 d5

}{C13

1 C132

}{C34

1 C342 }

{d1 d4}

{C31C3

2} {C241 C24

2 } {d3 d6}

{C141 C14

2

}{C35

1 C352 }

{d1 d5

}{C4

1 C42} {C25

1 C252 } {d3 d7

}{C15

1 C152

}{C36

1 C362 }

{d1 d6}

{C51C5

2} {C261 C26

2 } {d4 d5}

{C161 C16

2

}{C37

1 C372 }

{d1 d7

} {C6

1C62} {C27

1 C272 } {d4 d6

}{C17

1 C172

}{C38

1 C 382 }

{d2 d3}

{C71C7

2} {C281 C28

2 } {d4 d7}

{C181 C18

2

}{C39

1 C392 }

{d2 d4

}{C8

1C82} {C29

1 C292 } {d5 d6

}{C19

1 C192

}{C 40

1 C402 }

{d2 d5}

{C91C9

2} {C301 C30

2 } {d5 d7}

{C201 C20

2

}{C41

1 C412 }

{d2 d6

}{C 10

1 C102

}{C 31

1 C312 } {d6 d7

}{C21

1 C212

}{C42

1 C422 }

{d2 d7}

{C111 C11

2

}{C 32

1 C322 }

3.1 Chunk Based Data-Layout SchemeSpecifically, according to theory 3, for any number of disks, they will have the same number of overlap chunks. Since this theory makes sense only when the number of disks in consideration is no larger than the replicas ‘k’ (otherwise, the number of disks larger than k will of course have 0 overlap chunks), each k disks will be selected as a set to place the k replicas of one chunk. For n different disks, we

have the total number of (nk) sets. If (nk)<Q , the

combination sets will be repeated to place data chunks.In order to achieve map efficiency, given a chunk ID, it should be easy to calculate the disks which contain the replicas of the chunk. Generally, the below equation tells us that the ith chunk will store on the disks with index of (x1 , x2 ,… xk).

i %(nk)=∑

i1=1

x1−1

∑i2=i1+1

n−k+1

… .. ∑ik−1=ik−2+1

n−1

∑ik=ik−1+1

n

1+…

+ ∑ii=x1+1

x i−1

….. ∑ik−1=+1

n−1

∑ik= j+1

n

1+…+ ∑ik=x k−1+1

xk

1

For instance, if k = 2, i = 32, we can get (x1=2 , x2=7) by solving the above equation, that is, the two copies of chunk 32 will be stored on disks 2 and 7. Table 1 is a layout example of 7 disks (d1,d2, …d7) and 42 data chunks (C1

i ,

C2i …C 42

i ) with i = 1 or 2.

If the data chunks are placed according to the combination series [17], any two disks will share the same number of overlap data chunks whenQ %(n

k )=0. Since we place the k copies of

each chunk over k disks, and we take these k disks as a set from the combination series, any two disks as a atomic appearing in the set series will have the same number of overlap chunks. The number of shared chunks within one set is Q /(n

k ). And any two disks have equal chance of being together. In fact, the number of two disks appears as a pair is (n−2

k−2).

3.2 Approximating Power ProportionalityIn Section 2.1, we found that there is no way to achieve perfect or ideal power proportionality since Q will be far larger than n. Researchers exploit different data layouts to approximate the power proportionality at different levels. In practice, the storage system is always required to keep at least one copy of all data chunks active, that is,

x=Q+∆ xSuppose a layout with p disks as an isolate group, andφ (∆ x) means the extra number of disks which contain∆ x. According to the discussion is Section 2.1, the goal is to achieve

Q+∆ xk∗Q

≈ p+φ(∆ x )n

∆ x , φ (∆ x ) , parethree relative variables and the key part is to find the correlation between ∆ x∧φ (∆ x ). Generally, φ ( ∆ x ) has an increasing relationship with ∆ x , that is, more service requests will lead to more active disks. We will discuss how to deal with p,∆ x ,φ (∆ x ) to approximate the power proportionality.

3.2.1 Minimum Power NeededAccording to the approximation equation, if ∆ x is zero, only one copy of the chunk data is needed, andφ (∆ x) will

be zero. Thus, we have 1k

≈ pn

and p ≈ nk

. If nk

is not an

integer, we use below equation to make p smaller and save more power.

Qp+1

≤ (k−1 )∗Qn−p

That is

p=⌈ n−k+1k

⌉

For example, when n = 10, k=3, p will be 3 and the remaining number of disks will be 7. We use 3 disks to hold one copy of data and the placement method in Section 3.1 to place other 2 copies’ data on the remaining 7 disks (called ‘non-p’ group).

3.2.2 Maximize Efficiency of Additional Power Usage by Taking Load Balancing into Consideration Most of the time, the system needs only one copy of the data available for general service. However, during some certain times, the system may be busy, and we need to

i = 16, x1=2 , x2=7

power on extra disks to share the workload on the p disk. Thus, the relationship between ∆ xand φ ( ∆ x )will be considered. The dynamical provisioning techniques and reload dispatching can get a decent relationship between ∆ xand φ ( ∆ x ), but it usually needs to migrate petabytes of data, which violates the property of stability of the storage system. Considering the group based layout, if the primary group is busy, another group of disks could be powered on to share the service. However, sometimes it may be not necessary to power on the entire group of disks because only one or two disks are busy during a period. Thus, the policy to turn on a whole group of disks does not maximize the efficiency of the additional power usage.We have modified our proposed data layout to maximize the additional power usage. Since the data distributed on the p disks does not have any relationship with that of the non-p disks until now, some mirror constraints could be imposed between them. Specifically, we map each p disk to the non-p disks with different percentages of overlap chunks. Assume that we record the p disk as p1 , p2 … pp, the non-p disks as d1 , d2…dn−p,and the overlap percentage between piand d jisvpi , j. Then the choice of percentages is constrained by the following equation, andvpi , j is determined by p. The overall value of vpi , j is fixed once p is selected. Specifically, if the data on primary disks are

evenly declustered on non-p disks, vpi , j will be equal to 1p

.

1k−1

∗(∑j=1

n−p

vp i , j)∗(k−1 )∗Q

n−p=Q

p(i=1,2… p)

That is

∑j=1

n−p

vpi , j=n−p

p(i=1,2… p)

Consider an example of n = 10, p = 3 and non-p = 7. According to the above equation, we build a mapping relationship for p and non-p disks in Table 2, and the actual data distribution on p1 , p2∧p3 are in Table 3. Thus, if during a period, we need an extra disk to share the workload of p2,d3 or d4 could be powered on since they can share more overlap chunks with p2. If only one disk needs to power on for sharing all of the workload onp1 , p2∧p3, d7 could be the choice.

In a specific storage system, with different incoming service distributions, the chosen percentages between p disks and

non-p disks are also different. With more bias on the incoming service, a larger percentage should be chosen. The advantage of this modification is that, the mapping relationship with different percentages provides a much more flexible powering-up selection than group based. The energy spent on each disk for service is based on the distribution of incoming requests rather than the disk’s position in the group. Thus, according to the change of workloads on each p disk, we can power up or power down the corresponding disks to make the additional power usage efficient.

Table 2. An example of different overlap percentage between p disks and non-p disks

p1 p2 p3

d1 2/3 1/6 1/6

d2 2/3 1/6 1/6

d3 1/6 2/3 1/6

d4 1/6 2/3 1/6

d5 1/6 1/6 2/3

d6 1/6 1/6 2/3

d7 1/3 1/3 1/3

3.3 Fault ToleranceThis section will describe the policy of the proposed layout that allows the storage system to remain power proportional during the server’s failure. Disk failure, especially multi-disk failure, is a hard topic since a lot of combinational failures need to be discussed. In this paper, we consider disks’ crash failures and not arbitrary Byzantine failures. We assume that the failed disks are performing services. If the disks are not in service mode, the problem is not so critical, and we can take time to perform the recovery.As it has been discussed in Section 2.2, the distributed reconstruction property cannot be kept when the system provides proportional service by switching on/off the disks, because the overlap chunks between any two disks will change with respect to the number of active disks. In order to keep power proportionality and distributed reconstruction, we modify the layout and add a recovery policy. The failure recovery process should involve the below important issues:

Availability: all data may be accessed immediately, which ensures that every chunk is replicated on at least one active disk.

Table 3. The data on three p disks in a primary group

disk p1 C10 C2

0 C30 C4

0 C50 C 6

0 C 70 C8

0 C90 C10

1 C110 C22

0 C270 C 32

0

non-p diskp-disk

disk p2 C120 C13

0 C140 C15

0 C160 C17

0 C180 C23

0 C240 C28

0 C290 C33

0 C360 C39

0

disk p3 C190 C20

0 C210 C25

0 C260 C30

0 C 310 C34

0 C 350 C37

0 C380 C40

0 C410 C 42

0

which ensures that every chunk is replicated on at least one active disk.

Load balancing: the ongoing service on the failed disks should be rescheduled to other active disks. Also, to recover the data for the failed disks, the recovery load on the active disks should be taken into balancing consideration.We will discuss the recovery activity by answering the below questions:

1. When a failure occurs, and extra disks need to be powered on to keep the user’s experience, how could we select disks for recovery?

2. How should the workload of the failed disks to be distributed on the remaining active disks?Due to the disk’s failure, the workload that will be handled by the storage system is the ongoing service of the failed disks, and the recovery workload for the failed disks.According to the discussion in Section 3.2, the ongoing service of the failed disks is at most equal to the full workload that the failed disks can take. The recovery for the failed disk will perform at the same time as well. Thus, the number of extra powering-up disks will be determined by the recovery speed requirement and the ongoing service of the failed disks. Since the disks having a larger overlap percentage with the failed disks can directly handle more of the workload of the failed disks, the powering-up order for recovery will be chosen according to the overlap percentages. We will answer the second question by adding a scheduling policy according the specific failure situation.

3.3.1 Single Disk Failure handlingWhen only one disk fails, suppose some extra disks should be powered up. The ongoing and recovery workload of the failed disks will be rescheduled like below. If the failed disk is p disk: the extra powering-up non-p

disk plus the remaining active disks should ensure that every chunk is replicated on at least one active disk. The ongoing service of each active p disk can be rescheduled to the extra powering-up disks (This could be done because the data on each p disk is declustered on the non-p disks with different percentages). Then, the ongoing service of the disks with a high overlap percentage could be rescheduled to the active p disk. Finally, the workload of each p disk will not change. The disks which a larger over overlap percentage with the failed disk could be used for recovery and take the ongoing service of the failed disk.

The failed disk is non-p disk: the failed disk could be recovered by all of the p disks while the equivalent

workload of the p disk is rescheduled to the extra powering-up disks.

Our proposed data layout makes the data on each p disk decluster to the entire non-p disk with different percentages. Thus, if the workload in the whole system is low, the extra number of powering-up disks could be smaller even if the failed disk is a p disk. For instance, in the previous example, the failed p disk will cause two disks to be powered on. However, if the layout is group based, every disk from another group will be powered on since the data of the failed disks is distributed across the other group.

3.3.2 Multi-Disk Failure handlingMulti-disk failures are more complicated. We still consider the ongoing service of the failed disks and recovery of the failed disks. All of the failed disks are p disks.

The extra powering-up disks plus the active disks should ensure that every chunk is replicated on at least one active disk. The corresponding disks holding high overlap percentages with the failed disk will be firstly considered to be powered on. The partial ongoing service of each active p disk will be rescheduled to the extra powering-up disks. Then, we reschedule the ongoing service of the mapping disks, which have higher percentages with the failed disks, to the p disk. Thus, the workload of each p disk does not change. The disks which have high percentages with the failed disk could be used for recovery and take the ongoing workload of the failed disk.

All of the failed disks are non-p disks.The failed disks could obtain the available data from each p disk while the corresponding equivalent workload of the p disks is rescheduled to the extra powering-up disks if needed.

The failed disks are both p and non-p disks. If extra disks need to be powered on, the corresponding disks holding high overlap percentages with the failed disks will be firstly considered to be powered on. The non-p failed disks obtain the available data from each active p disk while the corresponding equivalent workloads of the p disk are rescheduled to the extra powering-up disks if needed. The high percentage overlap disks will engage the recovery, as well as the ongoing service of the failed p disk. The corresponding workloads of the disks are rescheduled to other active disks.

The failed disks are both p and non-p disks, but there are k (k<f) disks which have overlap chunks. The failure is not recoverable.

These policies could perform better when recovering since our proposed data layout makes the data on each p disk decluster to all of the non-p disks with different percentages. Also, the non-p disks share same number of overlap chunks which supports parallel recovery. So, all disks are intertwined with overlap chunks. Like what we discussed in section 2.2, the number of overlap chunks will determine the recovery speed. In fact, the max recovery

degree of CPPL is n-p while that of group based is ❑k [11].

At a representative setting with n=10, p=3, k=3, CPPL achieves more than 2 times higher degree of recovery parallelism than group based one.

4. EXPERIMENTAL RESULTSSimulations on DiskSim[18] are executed to demonstrate the performance of the chunk based layout for multi-way replication architecture. We implemented the address mapping algorithms for CPPL and group base scheme layout, called Power-aware grouping[11]. After that, we compared their performance through service request driven simulations.In DiskSim, the required architecture is demonstrated in Figure 3. A trace generator or an input trace is at the top layer. The gray (green) boxes represent existing disksim modules, and the white boxes represent added modules for implementing our mapping algorithms, load monitor and the scheduling policy.

Figure 3. DiskSim simulation architectureIn the main comparison experiments, we use 15 disks, because group declustering layout requires the number of disks to be a multiple of the number of copies. We pick three-way replication since it is the most common case in multi-way replication based architectures. In our experiments, we mainly consider the workload as read requests. For the write workload, we can employ write offloading policy [22], which writes to any available server, and corrects the layout later. Write offloading was used to avoid bottlenecks caused by overloaded in an enterprise storage system. Thus, if the disk is not active, we can take it as overload and update later. The service workloads are

synthetic: a bunch of simulated clients generate independent logical requests with uniform distribution. In the simulations, we count the max workload according to I/O per second (IOPS) of each disk. The random requests on each simulated client are produced by the linear congruential generator (LCG)[19].4.1 Proportional Service PerformanceWithout considering the power proportionality, load balancing for services is the desirable property which is defined as all n disks being accessed in parallel for any incoming services. However, the power saving property asks for less disks used if it does not affect users’ experiences. In normal situations, the p disks are powered on to respond to requests. With time, if the total request services allocating to a p disk exceeds the maximum workload, we need to power on another disk for sharing the workload of the disk. We use Shortest Queue First (SQF) to assign the requests to active disks. The algorithm always tries to assign the new request to the active disk with the lowest workload. If the request data is not on the disk, the process will continue to search until the request service can be responded by an active disk. If the process fails to assign the request to active disks, more disks should be activated. Power-aware grouping layout will power up an entire group of disks if the current active disks cannot satisfy the request service. For chunk based layout, the disks’ powering-up order is firstly considered by the disks which can leverage better load balancing with the highest ongoing service of the p disks, like we discussed in Section 3.2. Also we add the metric of ideal power-proportional (Ideal PP in brief) with respect to workload according Section 2.1.

Figure 4. The power performance comparison for Financial I trace: Failure free mode

With numerous requests under different workloads, we found that the layout based on chunk can save more energy than that based on group without affecting users’ experience. Figure 4 demonstrates the power usage for the

OLTP I/O trace named Financial I trace, which is a typical storage I/O traces from Umass Trace Repository [23]. Since any real trace has specific bias, we use statistics to evaluate the performance under a variety of workloads in Figure 5. For each workload, 100 different request cases are running. The diagrams demonstrate the overall average power percentage used compared to the whole power when all disks are powered on. The power usage curve of chunk based layout is lower than that of power-aware grouping. Thus the policy to power up an entire group of disks will use more power. In the Figure, we found under most times, the active disks are not efficiently used by PowerAwareGrouping. This is because another entire group disks are powered up even when only one or two disks are needed. These results agree with our design principle. When only a few disks are overloaded, a smaller number of disks are powered on to share the workload.

Figure 5. The power usage: Failure free modeWith numerous requests under different workloads, we found that the layout based on chunk can save more energy than that based on group without affecting users’ experience. Figure 4 demonstrates the power usage for the OLTP I/O trace named Financial I trace, which is a typical storage I/O traces from Umass Trace Repository [23]. Since any real trace has specific bias, we use statistics to evaluate the performance under a variety of workloads in Figure 5. For each workload, 100 different request cases are running. The diagrams demonstrate the overall average power percentage used compared to the whole power when all disks are powered on. The power usage curve of chunk based layout is lower than that of power-aware grouping. Thus the policy to power up an entire group of disks will use more power. In the Figure, we found under most times, the active disks are not efficiently used by PowerAwareGrouping. This is because

another entire group disks are powered up even when only one or two disks are needed. These results agree with our design principle. When only a few disks are overloaded, a smaller number of disks are powered on to share the workload.

[4.2] Degraded Mode PerformanceIn this section, we demonstrate the energy performance when the system enters into degraded mode. Like what we have discussed, the system should handle the workload --- the ongoing service of the failed disks and the recovery workload for the failed disks. For both layouts, we apply the SQF algorithm to handle the workload without affecting users’ experience. The algorithm always tries to assign the workload of failed disks to the active disk with the lowest workload. If the process fails to assign the workload of failed disks to these active disks, more disks should be activated. Power-aware grouping layout will power an entire group of disks if more disks need to be activated. For CPPL, the disks’ powering-up order is determined by the recovery policy proposed in Section 3.3.

4.1.1[4.2.1] One-disk FailureIn order to demonstrate the difference in energy performance with/out failure, the system first tries to perform the request service without failure. Then, the system responds to the request, and during the running time, one active disk is randomly chosen to fail. Figures 6, 7 show their power usages respectively under failure-free and single failure modes with different workloads. The whole workload is the one which could be dealt with (n-1) disks. From the figures, we found that the average power used by two layouts is increasing with the workload. For group based, the power usage is double the primary disks’ even under little workload, which means, a whole group of disks must be powered on since the data of the failed disk is distributed across that group. For CPPL, only a smaller number of disks will be powered on when the failure occurs. Figure 8 shows the power usage under degraded mode. We found more than 30% of power could be saved by the chunk based layout. With the workload larger than 85%, all disks will be powered on with one disk failure.4.1.2[4.2.2] Two-disk FailureThe simulation process is same to that of one disk failure. The system first tries to take the request service without failure. Then the system responds to the request, and during the running time, two active disks are randomly chosen to fail. Figures 9,- 10 show the power

usage respectively under failure-free and two disk failure modes with different workloads. The whole workload is the workload which could be dealt with (n-2) disks because two disks will be

supposed to fail. The average power used by both layouts is increasing with the workload. For group based, the power usage is still double the primary disks

Figure 6. The power usage of The power usage of CPPL: Failure free VS Degraded modeCPPL: Failure

free model VS Degraded mode

Figure 7. The power usage of The power usage of

PowerAwareGrouping layout: Failure free VS Degraded

modePowerAwareGrouping layout:

Failure free model VS Degraded mode

Figure 8. The power usage:The power usage: Degraded mode

Degraded mode

Figure 9. The power usage of CPPL: Failure free model VS

Degraded mode (two disk failures)CPPL: Failure free model

VS Degraded mode (two disk failures)

Figure 10. The power usage of PowerAwareGrouping: Failure free

model VS Degraded mode (two disks

failures)PowerAwareGrouping: Failure free model VS Degraded

mode (two disks failures)

Figure 11. The power usage: The power usage: Degraded modeDegraded mode

under little workload, which means, a whole group of disks needs to be powered on even if one disk is failed on primary. Figure 11 shows the power usage for all layouts under degraded mode. We found the power used by the layout of chunk based is less than that of group based. However, less percentage of power could be saved in comparison to that of one disk failure, since more disks should be powered on for recovery when more disk failures are occurring. At around the workload of 70%, every disk will be powered on under two disk failures.

5. CONCLUSIONIn this paper, we have exploited the placement data layout in a multi-way replication based storage architecture. We have conducted a complete theoretical analysis on the characteristics of an ideal layout which can support power proportionality. We proposed a novel placement data layout based on chunks. The proposed scheme tries to approximate

the power proportionality while keeping the desired property of parallel recovery. Compared with layout schemes based on group, CPPL can save about 30% energy at degraded mode, and achieve two times higher degree of recovery parallelism at a typical settingcomprehensive simulation results show that our proposed scheme obtains much better power savings and recovery quality without affecting the user’s experience.

REFERENCES[1] Andr, L., et al., The Case for Energy-Proportional Computing.

Computer, 2007. 40(12): p. 33-37.

[2] Keqin, L., Performance Analysis of Power-Aware Task Scheduling Algorithms on Multiprocessor Computers with Dynamic Voltage and Speed. Parallel and Distributed Systems, IEEE Transactions on, 2008. 19(11): p. 1484-1497.

[3] Luna Mingyi, Z., L. Keqin, and Z. Yan-Qing.Green Task Scheduling Algorithms with Speeds Optimization on Heterogeneous Cloud Servers. in Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference

on & Int'l Conference on Cyber, Physical and Social Computing (CPSCom). 2010.

[4] Li, K. Design and Analysis of Heuristic Algorithms for Power-Aware Scheduling of Precedence Constrained Tasks. in Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. 2011.

[5] Fan, X., W.-D. Weber, and L.A. Barroso, Power provisioning for a warehouse-sized computer.SIGARCH Comput. Archit. News, 2007. 35(2): p. 13-23.

[6] Pinheiro, E., et al., Dynamic cluster reconfiguration for power and performance, in Compilers and operating systems for low power. 2003, Kluwer Academic Publishers. p. 75-93.

[7] Chase, J.S., et al., Managing energy and server resources in hosting centers. SIGOPS Oper. Syst. Rev., 2001. 35(5): p. 103-116.

[8] Doyle, R.P., et al., Model-based resource provisioning in a web service utility, in Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4. 2003, USENIX Association: Seattle, WA.

[9] Chen, G., et al., Energy-aware server provisioning and load dispatching for connection-intensive internet services, in Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. 2008, USENIX Association: San Francisco, California.

[10] Chen, L.T. and D. Rotem, Optimal response time retrieval of replicated data (extended abstract), in Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 1994, ACM: Minneapolis, Minnesota, United States.

[11] Clark, C., et al., Live migration of virtual machines, in Proceedings of the 2nd conference on Symposium on Networked Systems Design \& Implementation - Volume 2. 2005, USENIX Association.

[12] Thereska, E., A. Donnelly, and D. Narayanan, Sierra: practical power-proportionality for data center storage, in Proceedings of the sixth conference on Computer systems, ACM: Salzburg, Austria.

[13] Amur, H., et al., Robust and flexible power-proportional storage, in Proceedings of the 1st ACM symposium on Cloud computing, ACM: Indianapolis, Indiana, USA.

[14] Alvarez, G.A., et al., Declustered disk array architectures with optimal and near-optimal parallelism, in Proceedings of the 25th annual international symposium on Computer architecture. 1998, IEEE Computer Society: Barcelona, Spain.

[15] Holland, M. and G.A. Gibson, Parity declustering for continuous operation in redundant disk arrays. SIGPLAN Not., 1992. 27(9): p. 23-35.

[16] Zhu, H., P. Gu, and J. Wang, Shifted declustering: a placement-ideal layout scheme for multi-way replication storage architecture, in Proceedings of the 22nd annual international conference on Supercomputing. 2008, ACM: Island of Kos, Greece.

[17] Chen, M.-S., et al., Using rotational mirrored declustering for replica placement in a disk-array-based video server. Multimedia Syst., 1997. 5(6): p. 371-379.

[18] Graham, R.L., D.E. Knuth, and O. Patashnik, Concrete Mathematics. 1989, Massachusetts: Addison-Wesley.

[19] The disksim simulation environment (v4.0). http://www.pdl.cmu.edu/DiskSim/.

[20] Knuth, D., The Art of Computer Programming. Vol. 2. 1997.

[21] Lu, L., Varman, P., Wang, J.. DiskGroup: Energy Efficient Disk Layout for RAID1 Systems. The 2007 IEEE conference on networking, architecture, and storage (NAS’07).

[22] Dushyanth Narayanan, Austin Donnelly, EnoThereska, SamehElnikety, and Antony Rowstron.Everest: Scaling down peak loads through I/Ooff-loading. In OSDI ’08: Proceedings of the 1stUSENIX conference on Operating Systems Design and Implementation, 2008.

[23] Umasstracerepository. http://traces.cs.umass.edu/index.php/Storage/Storage.

introduction - ucf computer sciencejwang/papers/pact_jiangling,yin_v8.… · web viewthe...

Documents