a request skew aware heterogeneous distributed

A request skew aware heterogeneous distributed storage system based on Cassandra

Zhen Ye, Shanping Li Department of Computer Science and Technology

Zhejiang University Hangzhou, China

{yezhen, shan}@cs.zju.edu.cn

Abstract—many distributed storage systems have been proposed to provide high scalability and high availability for modern web applications. However, most of those applications only aware data skew while actually request skew is also widely exist and needed to be considered as well. In this paper, we present a request skew aware heterogeneous distributed storage system based on Cassandra—a famous NoSQL database aiming to manage very large scale data without single point of failure. We improve Cassandra through two ways: 1) minimize forward request load by shifting the node where the client application connect to the one which can handle maximum number of skewed request dynamically; 2) when balancing data load among all nodes within the cluster, take their storage capacity into consideration. The results of our experiment present that we can reduce about 25% forward read request and 15% forward write request through approach 1) and balance storage utilization of each node obviously after applying 2).

Keywords: Distributed Storage System; NoSQL Database; Request Skew; Heterogeneous environment

I. INTRODUCTION Modern web applications probably have to deal with very

large scale data set. For most of these applications, the most characteristics they want are high scalability, high availability and the ability to response quickly even when there exist hundreds of thousands of request concurrently. Comparing with these, strong data consistency and strict transaction support, such as ACID, sometimes can be weakened or even dropped up. Obviously, traditional Relational Database is not suitable for serving these kinds of applications, since most of them are transaction based and are hard to scale to very large size or with a very high cost; in addition, relational database always provide too large feature set that many of them will not be used, which only add the cost and complexity [6, 7].

So in the past few years, many scalable NoSQL related distributed storage systems have been proposed, e.g. Google’s Bigtable [2], Amazon’s Dynamo [3] and Yahoo!’s PNUTS [4]. Based on CAP theory [8], for distributed system we cannot get high availability while still maintain the strong consistency. For this reason, most of these systems relax strong consistency and strict transaction to get high scalability and availability, thus they can scale out dynamically to the internet size, have high resilient to the node failure or network failure and serve well to the massive access. Data partition and data replication are the

two technologies these systems likely to use to achieve above targets. Data partition can be used to improve scalability and performance while data replication is a good way to get high availability and balance the load.

As we know, most Internet-scale applications exhibit highly skewed workload, including data skew and request skew, which the system should distribute to its nodes evenly in order to improve their usability. However, data skew and request skew may have contradiction. Sometimes the data already distributed evenly while request load still skew to one or some nodes, or vise verse. Although now most of the system said they aware those skews, actually many of them only care about balancing the data into different nodes. Balancing both request skew and data skew is still a challenge issue in many systems. In addition, when balancing the data, many systems assume all the nodes are identical, but actually in distributed environment, different nodes may have different capacity, which we should take into consideration when designing the system.

In this paper, we present a request skew aware heterogeneous distributed storage system which is based on Cassandra [5]. Cassandra is a NoSQL Database for managing very large amounts of data spread out across many commodity servers, while providing highly available service with no single point of failure. However, Cassandra only provides data skew aware solution while assuming there is no request skew. Also it assumes that all the nodes have the same storage capacity. We improve Cassandra through two ways: 1) minimize forward request load by changing the node where the client application connected to the one which can handle maximum number of skewed request dynamically; 2) when balancing storage load among all nodes within the cluster, take their storage capacity into consideration to maximum utilization.

The remainder of the paper is organized as follows. Section II introduces the related work. Section III presents the background of Cassandra and also how we improve it. Section IV studies different experiments. And finally, we present conclusions and future work in Section V.

II. RELATED WORK Bigtable, Dynamo and Pnuts all are widely studied and

cited in distributed storage system domain. Bigtable is implemented as sparse, multidimensional sorted maps, which is richer than key-value data model while still simple enough. It

978-1-4244-9283-1/11/$26.00 ©2011 IEEE

uses Google File System (GFS) [1] to store data and log information. GFS can divide file into fixed size chunks and balances the data load by distributing those chunks into different nodes evenly. However, GFS use primary copy, pessimistic algorithm to synchronize data between different replicate nodes, which make it scale and performance poorly in the write intensive and wide area scenario.

Dynamo is a highly available, eventually consistent key-value storage system that uses Consistent Hashing to increase scalability, Vector Clock to do reconciliation and Sloppy Quorum and Hinted handoff to handle temporary failure. In Dynamo, all the nodes and the keys of data items are hashed and then the output values are mapped to a “ring”. The node’ output represent the position of this node in the “ring”, while the key’s output decide which node this item will be stored. In order to balance the data load, Dynamo map one real node into many virtual nodes, each of them occupy one position in the “ring”, different node may have different number of virtual nodes, based on their capacity.

Pnuts, a geographically distributed database system, use pub/sub message to assure the order of update for one key and provide per-record timeline consistency, which is between strict serialization and eventually consistency. Similar with Bigtable, Pnuts use one centralized router to look up the right node for a specified key and divide data into many fix sized tablet. It can move tablet from overload node to low load node to get data load balanced and will dynamically change node the client connect to reduce forward request.

ecStore [9] is an cloud based elastic storage system which supports data partition and replication automatically, load balancing, efficient range query and transactional access. It use stratum architecture: BATON tree based data partition as the bottom layer to provide highly scalability, 2 tie load-adaptive replication as the middle layer to balance load and provide highly availability and multi-version optimistic concurrent control as the top transaction level to provide data consistency. It use data partition to solve data skew issue and solve request skew by adding second replicas to those hotspot data. ecStore use primary copy optimistic replication and provide adaptive read consistency. However, if you choose read consistency value equal to the number of replicas, which means it will not feedback to client until the data has been updated to all replicas, the result is the same with using primary copy pessimistic replication and may cause poor performance issue, otherwise, you may get corrupt data in the situation where recent updated data has not been synchronized to the nodes that you are accessing to.

S. Bianchi et al. [10] studies the load of a P2P system under biased request workloads. It discovers that those systems show a heavy lookup traffic load and also the load that in the intermediate node that is responsible for forward the access to the target node. Based on this, the authors propose a way to use Routing Tables Reorganization to reduce forward request load and use cache and data replicas to reduce local request load to balance the traffic load. As it needs hundreds of nodes in the experiments, the authors just did some simulation experiment.

M. Abdallah et al. [11] proposes a load balancing mechanism that takes into account both data popularity and

node heterogeneity. It defines the load measure as a function of the number of queries issued on data items per time frame and then proposes a mechanism that balances the system's load by adjusting the DHT structure so that it best captures query load distributions and node heterogeneity. However, this system does not consider the data replication. Also it assumes that the forward request and the response message have the same load, which may be not correct.

III. SYSTEM DESIGN Cassandra is inspired by Bigtable and Dynamo, it integrate

Bigtable's Column Family based data model and Dynamo's Eventually Consistency behavior and thus can get both of their advantages.

As with Dynamo, Cassandra use Consistent Hashing to partition and distribute data into different nodes by hashing both nodes and data’s value into a “ring”. To improve availability and balance the load, all the data are replicated into N nodes, where N is the replication number that can be configured in advance. First, it will assign each data item a Coordinator Node, which is the first node this data item meet when walk clockwise around the “ring”, then replicate this data item to the next N-1 clockwise successor nodes in the “ring”. To adapt between strong consistency and high performance, Cassandra provide different consistency level options to both read and write operations. For write, Consistency.One means the operation will only be routed to the closet replica node. Consistency.Quorum means system will route request to quorum, usually N/2+1, number of nodes and wait for their responses. Consistency.All means it will route request to all N replica nodes and waiting for their response. For read, the operation will be routed to all replicas, but only will wait specific number of response, others will be received and handled in asynchronies way. For Consistency.One, this number is 1; for Consistency.Quorum, this number is N/2+1 and for Consistency.All this number is N.

In this chapter we will introduce how we improve Cassandra.

A. Minimize forward request load In order to use Cassandra cluster, client application needs to

connect to a node within the cluster. We name this node as the Connected Node. When client read or write a data item, if this Connected Node is not one of the replica nodes responsible for this item, it has to forward the request to one or more nodes, wait for their response and finally feedback to the client. Since most of the applications exists request skew, when you choose different node as Connected Node, the total forward request load will be different. However, client often do not know which node has the least forward request load at first. Even client connect to the most loaded node at first, however, as the time goes by, the users’ access pattern will change, and thus the hot spot data will also changes. For these two reasons, we need to change this configuration dynamically to minimize the forward request load.

In Cassandra, we will meet three kind of forward request:

//It will be accessed in Connected Node1: nodes←findNodes(key) 2: for node∈nodes do 3: load←baseLoad //Each write operation's load 4: if blockNumber equals 1 5: load←baseLoad *weightOne 6: end if 7: if (blockNumber great than 1)

and isReadOperation() 8: load←baseLoad *weighRead 9: end if 10: addLoad(node, load) 11: end for

1: maxNode ← maxLoadNode() 2: if getload(maxNode) – getLoad(connectedNode) great than changeFactor* totalClusterLoad() 3: changeConnectedNode(node) 4: end if

1: localUsedRatio←localUsedSize / localTotalSize 2: averageUsedRatio←getClusterAverageUsedRatio() 3: if localUsedRatio great than

moveRatio * averageUsedRatio 4: candidateNodes ← find all nodes whose

(usedSize+localUsedSize)/(totalSize+localToalSize) less than moveRatio * averageUsedRatio

5: targetNode←minUsedRatio(candidateNodes) 6: Move local node to let newLocalUsedRatio

equals newTargetNodeUsedRatio 7: end if

• K1: Request’s consistency level is Consistency.One, in this situation the Connected Node only forward request to closest replica and wait for its response.

• K2: If it is read request and consistency level is not Consistency.One, Connected Node will forward two type of request: One is read request that will be routed to its closest replica, which then will return whole message to Connected Node; another is read digest request sent to other replicas, which only return message digest. After get all the response, system will digest the message and compare it with digests got from other replicas to see if they are the same version.

• K3: If it is write operation and not Consistency.One, Connected Node will forward and wait for blockNumber number of write responses according to the consistency level. Here blockNumber’s value is N/2+1 if consistency level is Consistency.Quorum or N if it is Consistency.All.

When accessing a data item, for different kind of forward request, the benefit we get is different after shifting Connected Node from original node to one of the replicas responsible for this item. In K1, if Connected Node is one of replicas, it does not need to wait any message from other remote nodes, which will improve its response time a lot. In K2, it still needs to wait read digest message from other replicas, but Connected Node can handle read request itself. In K3, it still needs to wait (blockNumber-1) number of write response from other replicas, which comparing with other 2 situation, the improvement is limited.

Based on this observation, we give out our improvement. The opinion is recording all nodes’ request load in Connected Node, and assigns different kinds of request with different weight. Every specify time, system we compare the max request load with Connected Node, if its request load is much larger than original Connected Node, then we will change Connected Nod to that node. Fig. 1 and Fig.2 describe the Pseudo code.

B. Consider node’s storage capacity when balancing data among each node In order to balance the data load between each nodes,

Cassandra monitors each node’s data load information on the ring. If finding one node is overloaded, system will alleviate its load by moving its position along the ring. The detail checking and moving algorithms are described in [12].

However, Cassandra assumes each node has the same storage capacity, it only monitors the storage size each node has used and then uses this information to judge if there exist overloaded node or not. But in the reality, within one Cassandra’s cluster, different commodity server nodes may have different storage capacity, which also need to take into consideration when balancing the data.

In order to maximum each node’s storage utilization, we propose an enhanced data balancing algorithm. We suggest comparing each node’s local storage used ratio to whole cluster’s average storage used ratio. If it is large than

moveRatio, which can be configured, then we say this local node is overloaded. If finding one node is overloaded, we will select the nodes that have enough free space to make sure both themselves and overloaded node’s used ratio less than moveRatio after moving the position of overloaded node as candidates to move with. If there is more than one candidate, we will select the one who has the minimum original used ratio as the target node and move the overloaded node beside it to balance the data between them.

Since our balancing algorithm is based on average used ratio, so if two nodes’ total storage capacity is very different, even their used ratios are similar, their used storage size also will very different. It means one node’s storage data number is much more than another one, which sometimes also means this node has much more request load than that one. We solve this potential issue by setting a variable called allowCapacityRatio. For any node whose total storage is larger than allowCapacityRatio times of minimum nodes’ total storage, we will use this most allowed capacity to present its total capacity instead.

Figure 1. Record each node’s request load

Figure 2. Change Connected Node

Figure 3. Storage balance algorithm

IV. EXPERIMENT We run series of experiments to evaluate the result. The

base Cassandra version we use is 0.6.4. We set up a 6 commodity server nodes cluster; all nodes are within the same LAN and same Rack.

We did some changes to the TPC-W benchmark application and use it in our system. The requests in TPC-W are following Zipf distribution, which is used very often in the web application domain to simulate users’ real access model.

A. The result of minimize forward request load The criteria here we use is the total forward request number

for each Node. We assign following values to the variables describe in Fig. 1: baseLoad = 1, weightOne = 2, weighRead = 1.2. As in Fig. 2, we set changeFactor = 5% and make the system check one day a time to see if need to change Connected Node or not.

To see how the Replicas Factor and different consistency level affect the result, we set up 3 round tests, each round will be run 24 hours, and the units of request number in the following tables are ten thousands.

1) Replicas Factor = 3, W = 1, R = 1 In this round, each data item has 3 replicas distributed in 3

different nodes, all operation’s consistent level are set to Consistency.One, which means for write operation, it will only touch one node and for read operation, it will only wait for one response while other response will be got in asynchronous way.

We connect client to Node 3 at first, since its Ratio Difference is 1.8%, which is small than changeFactor, so the Connected Node has not changed after run the algorithm, but from Table I, we can see if we connect to Node1 or Node2 at first, then the Connected Node will be changed to Node4. In this round, there is no K2 and K3 request since all operation are Consistency.One. Factor Diff is defined as:

Factor Diffi = (loadmax - loadi) / loadtotal

Table II shows if the Connected Node is Node1 at first, it will reduce 34.8% of the forward read request and 31% of the forward write request. For Node2, the value is 29.4% and 25.7% respectively. There is no read digest request that needed to be forward in synchronized way in this round.

TABLE I. REQUEST LOAD DIFFERENCE IN ROUND1

Node1 Node2 Node3 Node4 Node5 Node6 K1 Request 232 256 315 349 323 266

Total Load 464 512 630 698 646 532 Factor Diff 6.4% 5.0% 1.8% 0% 1.4% 4.5%

TABLE II. FORWARD REQUEST REDUCED RATIO IN ROUND1

Forward Read request

Forward Write request

Read Reduced Ratio

Write Reduced Ratio

Node1 236 113 34.8% 31% Node2 218 105 29.4% 25.7% Node4 154 78 0% 0%

2) Replicas Factor = 2, W = 1, R = 1 In this round we change the factor to 2; the purpose is to

see how replicas number affects our algorithm. From table III we can see if we connect our TPC-W client application to Node1, Node2 or Node3 at first, the Connected Node will be changed to Node5 after running the algorithm. Table IV tells us if the Connected Node is shift from Node1 to Node5, it will reduce 25.2% forward read request and 18% forward write request. When it is changed from Node2 to Node5, then the results is 22.5% and 19.2% respectively. For Node5, result is 17% and 11.6%.

Comparing with the result in Round1, we find when other configuration remain the same, the less replicas number we use, the more chance the Connected Node will be changed, but for each change, the improvement is less than what we get from round 1.

TABLE III. REQUEST LOAD DIFFERENCE IN ROUND2

Node1 Node2 Node3 Node4 Node5 Node6 K1 Request 147 156 187 230 247 195 Total Load 294 312 374 460 494 390 Factor Diff 8.6% 7.8% 5.2% 1.5% 0% 4.5%

TABLE IV. FORWARD REQUEST REDUCED RATIO IN ROUND2

Forward Read request

Forward Write request

Read Reduced Ratio

Write Reduced Ratio

Node1 295 139 25.2% 18% Node2 284 141 22.5% 19.2% Node3 265 129 17.0% 11.6% Node5 220 114 0% 0%

3) Replicas Factor = 3, W = 2, R = 2 In round 3 we change both read and write consistency level

to Consistency.Quorum, to see how it affect our algorithm.

Since all requests are not Consistency.One, so there is no K1 request. Table V presents the detail result, from which we can see if we first connected client application to Node1 or Node2, the Connected Node need to be changed to Node4. As we can see in table VI, the read reduced ratio is the same with round one, write reduced ratio is less than the one in round one. That means if the replicas number is the same, the stricter the consistency level is, the less improvement it will get.

TABLE V. REQUEST LOAD DIFFERENCE IN ROUND3

Node1 Node2 Node3 Node4 Node5 Node6 K2 Request 154 171 213 236 218 177 K3 Request 78 84 102 113 105 89 Total Load 262.8 289.2 357.6 396.2 366.6 301.4 Factor Diff 6.8% 5.4% 2.0% 0% 1.5% 4.8%

TABLE VI. FORWARD REQUEST REDUCED RATIO IN ROUND3

Forward Read

request

Forward Read Digest

request

Forward Write

request

Read Reduced

Ratio

Write Reduced

Ratio Node1 236 390 304 34.8% 11.5% Node2 218 390 296 29.4% 9.1% Node4 154 390 269 0% 0%

B. The result of considering storage capacity In this experiment we set moveRatio = 1.5 and

allowCapacityRatio = 2.

First we run TPC-W for a long time to populate enough data into cluster nodes. Then we run load balance script command several times. In round one we use Cassandra’s original load balance algorithm, in round two we use our algorithm. Fig. 4 and Fig. 5 present the results. LB1 is the Load Balance algorithm provided by Cassandra and LB2 is our algorithm.

From Fig. 4 we can see although LB2 cannot distribute data into different nodes as evenly as LB1, but comparing with the original load distribution, it also has significant effect.

From Fig. 5, it is obviously that our algorithm can utilize different nodes’ storage capacity much better than its original one. As its utilization is more balanced, thus the whole cluster can store more data, which means the storage utilization is improved.

Figure 4. Storage size used by each node

Figure 5. Storage utilization of each node

V. CONCLUSIONS AND FUTURE WORK In this paper we have presented two ways to improve

Cassandra to make it aware request skew issue and notice different nodes’ capacity when balancing the storage load.

Firstly, we propose an algorithm that can minimize forward request by shifting Connected Node dynamically to the one that can handle maximum number of request locally.

Secondly, we give out a new idea that can improve storage utilization of each node by using used ratio instead of used size to do data storage balance.

After that, we did several experiments to evaluate the effectiveness of our approach. The result shows that in different scenarios, we all can reduce both forward read request and forward write request a lot. Also from the experiment we learnt the storage utilization will be balanced and improved obviously.

For now, we only assume all the nodes are within the same datacenter, we will extend our research to different datacenter in the future. Also, currently all data has the same replicas number, in the next step, we will think to add additional adaptive replicas for those nodes that contain spot hot data.

REFERENCES [1] S. Ghemawat, H. Gobioff and S. Leung, “The Google File System”, In

19th Symposium on Operating Systems Principles, Lake George, New York, 2003, pp. 29–43.

[2] F. Chang et al., “Bigtable: A distributed storage system for structured data”, In Proc. OSDI, 2006, pp 205–218.

[3] G. DeCandia et al., “Dynamo: amazon’s highly available key-value store”, In Proc. SOSP, 2007, pp. 205-220.

[4] B. F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform”, Proc. VLDB Endow, vol. 1, pp. 1277-1288, August 2008.

[5] A. Lakshman and P. Malik, “Cassandra: Adecentralized structured storage system”, SIGOPS Oper. Syst. Rev. vol. 44, pp. 35-40, 2009.

[6] M. Stonebraker, “SQL Databases v. NoSQL Databases”, Commun. ACM, vol. 53, pp. 10-11, April 2010.

[7] N. Leavitt, “Will NoSQL Databases Live Up to Their Promise?”, Computer, vol. 43, pp. 12-14, February 2010.

[8] E. A. Brewer, “Towards robust distributed systems”, Principles of Distributed Computing, Portland, Oregon, July 2000.

[9] H. T. Vo, C. Chen and B. C. Ooi, “Towards elastic transactional cloud storage with range query support”, Proc. VLDB Endow., vol. 3, pp. 506–517, 2010.

[10] S. Bianchi, S. Serbu, P. Felber and P. Kropf, “Adaptive Load Balancing for DHT Lookups”, ICCCN, 2006, pp. 411-418.

[11] M. Abdallah and E. Buyukkaya, “Fair load balancing under skewed popularity patterns in heterogeneous DHT-based P2P systems”, International Conference on Parallel and Distributed Computing and Systems, 2007, pp. 484-490.

[12] M. Abdallah and H.C. Le, “Scalable Range Query Processing for Large-Scale Distributed Database Applications,” Proc. Int'l Conf. Parallel and Distributed Computing Systems (PDCS), 2005.