fault tolerance on cloud computing

A Novel High Adaptive Fault Tolerance Model in Real Time Cloud Computing

Parveen Kumar

Asst. Professor, Computer Science & Engineering Department,

National Institute of Technology, Uttarakhand, India

[email protected]

Gaurav Raj Asst. Professor,

ASET-CSE Amity University, Noida.

raj@

Anjandeep Kaur Rai Department of Computer Science and

Engineering, Lovely Professional University

Phagwara, India [email protected]

Abstract—Now-a-days, cloud computing is being used in a variety of fields, whether it is storage, computation, education etc. It has emerged from a larger number of technologies like utility computing, grid computing, cluster computing etc. It offers a number of advantages like on-demand access, resource pool, device independence etc. Also, it suffers from various cons like security, workflow management, fault tolerance. Here, a novel model (HAFTRC) has been proposed, which is providing high adaptive fault tolerance in real time cloud computing. The model is based on computing the reliabilities of the virtual machines on the basis of the cloudlets, mips, ram and bandwidth etc. Whosoever virtual machine has the highest reliability is chosen as the winning virtual machine. If at the end, there are two virtual machines, whose reliabilities comes out to be same ,then the winning machine is chosen base on the priority that is assigned to them.

Keywords: — Reliability, fault tolerance, priority scheduling, timeliness.

I. INTRODUCTION Cloud computing refers to a model that provides a broad,

scalable and always available access to a variety of resources like infrastructure, platform, software, storage etc over the internet that can be accessed by the cloud users according to their requirements [1]. It actually supports the reusability of such resources across the boundaries of particular organizations [2]. For example, with cloud computing, user don’t need to install Microsoft Word on his 45 workstations, instead he just needs to have access to the internet and he can have the required software that is hosted on some other location and also he don’t need to worry about the licensing as well as installation or application of patches etc [3].

Real time systems are being employed in variety of applications like railway reservation system, small mobile phones, robotics, laser printers, pacemakers, video conferencing etc [4] . Real time systems have two main characteristics viz timeliness and fault tolerance [5].Timeliness denotes the property of the system to work correctly in the prescribed amount of time and fault tolerance is the ability of the system to work gracefully even in the presence of the fault, so that the user doesn’t get to know that any fault has occurred in the system [6].

Cloud provides minimum lag and maximum performance to such system, but also on the same side it increases the chances of errors in the systems as the nodes are located very far from each other [7]. Real time systems are also very critical in nature, so they need such mechanism which will allow them to work even if something mishappens in the system. So, the need of the hour is to have such mechanism which will allow the system to work well in the cloud. Here, a model HAFTRC has been proposed for providing high fault tolerance to real time application in the cloud infrastructure.

II. RELATED WORK A large amount of work has been already done in the area

of real time systems, but there is large research space for fault tolerance in real time systems on cloud infrastructure. Anjali D. Meshram et al., [8] presented FTMC (Fault Tolerance Model for Cloud computing) according to which virtual machines are made to run different algorithms, and their respective reliabilities are calculated based on whether the virtual machines produce the correct result and that also within the time. If they do so, then their reliability increases and similarly decreases as well, if they can’t do so. Also, a checkpoint has been added up in the model so that backward recovery can be performed in case of complete failure of the system. Sheheryar Malik et al., [7], presented a model which makes the system handle the fault and makes the decision according to the reliability of the virtual machines. Moreover, the reliability of virtual machines is adaptive in nature i.e. it changes after every computing cycle. Virtual machine’s reliability increases if it produces the correct result and within time and also it decreases if it fails to do so. In addition to it, if the node’s reliability goes on decreasing, then it is removed and a new node is added in its place. Reliability of every virtual machine is checked against a minimum reliability level; if that level is achieved by node, then it is fine otherwise the system will perform backward recovery. Sheheryar Malik et al. [8] gave a model which is based on the idea of time stamped fault tolerance. In this model, methodology related to distributed computing along with feed forward artificial neural network has been adopted. It comprises of forward as well as backward

138978-1-4799-4236-7/14/$31.00 c©2014 IEEE

Figure 1: Proposed Model (HAFTRC) [11]

recovery mechanism. Weights are assigned to the nodes that are made to run a variety of algorithms. Proposed model is based on the adaptive reliability of the virtual machine, i.e. the reliability of the node changes after every computing cycle. Fault tolerance has been achieved depending on the reliability of the virtual machines.

III. PROPOSED MODEL (HAFTRC) Here, a model HAFTRC (High Adaptive Fault Tolerance in

Real time Cloud computing) has been introduced (Figure 1). This model handles the fault on the basis of the reliability of the virtual machines. This model consists of two types of nodes: One node consists of a set of virtual machines and acceptance module and the second node: adjudicator node consists of three nodes for elasticity calculation, reliability calculation and decision making. A. Working

This model comprises of ‘N’ virtual machines or nodes which are made to run some operation. Then we have Acceptance Module (AM) which is responsible for verifying whether the output that has been produced by virtual machine is correct and that too within time limit or not. On the basis of result that is produced by the AM, Elasticity Calculation module (EC) checks whether the failed cloudlets are liable of having some elasticity in terms of CPU cycles. If they are, then they are declared as passed, otherwise fail. Then, we have Reliability calculation (RC) module, which is responsible for calculating the reliabilities of the virtual machines Also, the virtual machine’s reliability are matched with the System Reliability Level (SRL).The nodes which have reliability equal to or greater than SRL, are passed to the Decision Making module. Decision Making

(DM) module makes the final decision of selecting the most reliable node by considering the reliabilities of the passed virtual machines given by RC module. The node with highest reliability is selected as the final output. If two nodes have same higher reliability, then winning virtual machine is selected according to priority. Priority is assigned according to MIPS. B. Model Description

Acceptance Module (AM) is responsible for two things: first it is checking whether the operation that has been performed is correct or not. Secondly, it makes sure that the operation has been performed in a prescribed amount of time. Each node or the virtual machine takes a particular input, executes the operation and then produces the output. It only passes the result of all the nodes to the elasticity module. It also informs the Elasticity Calculation (EC) module to determine whether elasticity can be provided to the nodes in terms of CPU cycles.

Elasticity Calculation(EC) module analyses the

cloudlets and then determines whether the cloudlets are applicable to have elasticity of 15 CPU cycles or not. If the cloudlet is applicable to have elasticity then it is given so and then its fail status is changed to pass. Using this approach we can have more successful cloudlets than failed ones. If the cloudlet is not applicable to get elasticity then it is simply discarded and is declared as fail.

Reliability Calculation (RC) module is actually the heart of this model. This module is responsible for analyzing the reliabilities of each node. The reliability of virtual machine is adaptive, that is it changes after every

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence) 139

computing cycle. Reliability of virtual machine increases if any of the condition becomes true:

• The amount of ram in host should be greater than the amount of ram in each virtual machine.

• The amount of MIPS in host should be greater than the amount of MIPS in each virtual machine.

• The bandwidth in host should be greater than the bandwidth in each virtual machine.

• If all of the cloudlet gets succeeded then reliability of the virtual machine increases.

Reliability of virtual machine decreases when any of the above defined factors gets failed or in case if any of the cloudlet fails, then the reliability of the virtual machine on which it is running decreases by some extent. More the failed cloudlets more will be the decrease in the reliability.

Decision Making (DM) selects the virtual machine

which is having the highest reliability among all the nodes. If two nodes are having same highest reliability, then the winning virtual machine will be selected according to priority assigned. The node with highest priority will be selected as the more reliable node and then will be considered as winner. Priority of the virtual machine is according to the MIPS of the virtual machine, i.e. the node with highest MIPS is given the highest priority and so on. This model is very reliable as it continues to operate even if one of the nodes fails i.e. until all the nodes fail.

IV. FAULT TOLERANCE MECHANISM Here, the algorithms of various nodes have been discussed. Algorithm for Acceptance Module (AM)

If (cloudlet.Status=Success && cloudlet finishes in

prescribed time) Then

The cloudlet will move to the next stage Algorithm for Elasticity Calculation (EC) Module

If cloudlet needs 15 more CPU cycles to complete its

execution Then

Cloudlet is allowed to complete its execution Else

The cloudlet is designated as failed and is not allowed to move to next level

Algorithm for Reliability Calculation (RC) module

If (total amount of ram, MIPS and bandwidth in the host

is less than amount of ram, mips and bandwidth in each virtual machine)

Then Reliability decreases

Else if cloudlet fails Then Reliability decreases

Else Reliability increases Algorithm for Decision Making (DM) Module

If (first machine is having higher reliability than the second machine)

Then First machine is declared as the reliable machine

Else if (two machines have the same highest reliability) Then The machine with higher priority is declared as the best

machine Else

No machine is declared as the reliable machine

V. EXPERIMENTS AND RESULTS The High Adaptive Fault Tolerance in Real Time

Cloud Computing (HAFTRC) is implemented in CloudSim simulator. The version of CloudSim used is CloudSim 3.0.2.This is a bug free release. It has certain updates from the previous version of CloudSim 3.0.0 which are as follows: • The problem with ant class path declaration has been fixed. • Calculation of MIPS in PowerVmAllocationPolicy

Migrationbstrac.findHostForVm () has been acknowledged. • References have been updated to CCPE paper [12].

Here, 3 virtual machines have been created and two tasks on each virtual machine are made to run, i.e. we have total of 6 tasks or cloudlets. In this simulation, certain parameters have been assumed which are as under:

• SRL (System Reliability Level) value is assumed to be 0.6. • Elasticity has been provide of 15 CPU cycles to failed

cloudlets MIPS of virtual machines will be changed and results will be recorded.

First case: Here we are going to change only the MIPS of

VM1

MIPS of VM1:200, VM2:300, Vm3:400

In this case (Figure 2), all the cloudlets run on the three virtual machines. Cloudlet id 6 and 4 gets failed and both the cloudlets are moved to the elasticity calculation module, where cloudlet id 6 is given elasticity. So it gets passed.

140 2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

Figure 2: MIPS of VM1 changed- VM1: 200, VM2: 300,

VM3: 400

After that we have only cloudlet id 4 that with failed status. Now, the reliability is calculated on the basis of MIPS, ram and bandwidth, as every host is having all the above mentioned parameters greater than the virtual machine, so this gives them advantage. Along with that, because of the failure of the cloudlet 4, reliability of the VM1 decreases. Now the virtual machines are checked so as to know which of them have reliability greater than or equal to SRL. Here, both the virtual machines (2 and 3) have reliability greater than or equal to SRL, so both are now passed. At last, for selecting the most reliable machine, virtual machines are compared according to their priorities. As virtual machine 3 is having higher priority than 2, so it is declared as the more reliable machine at the end. MIPS of VM1:250, VM2:300, Vm3:400

In this case (Figure 3), all the cloudlets run on the three virtual machines. Cloudlet id 6 gets failed and is moved to the elasticity calculation module, where it is given elasticity. So it gets passed. Hence, now we have no failed cloudlets.


VM3: 400

Then, the reliability is calculated on the basis of MIPS, ram and bandwidth, as every host is having all the above mentioned parameters greater than the virtual machine, so this gives them advantage. Along with that, because all the cloudlets get succeeded, so reliability increases as well. Now the virtual machines are checked so as to know which of them have reliability greater than or equal to SRL. Here, all the virtual machines (1, 2 and 3) have reliability greater than or equal to SRL, so all are considered passed. At last, for selecting the most reliable machine, virtual machines are compared according to their priorities. As virtual machine 3 is having higher priority than 1 and 2, so it is declared as the more reliable machine at the end.

Case 2: Now, only the MIPS value of VM2 is changed,

rest are kept same.

MIPS of VM1:200, VM2:250, VM3:400

In this case (Figure 4), all the cloudlets run on the three virtual machines.

Figure 4: MIPS of VM2 changed- VM1: 200, VM2: 250, VM3: 400

Cloudlet id 6 and 4 gets failed and both the cloudlets are moved to the elasticity calculation module, where cloudlet id 6 is given elasticity. So it gets passed. After that we have only cloudlet id 4 that with failed status. Now, the reliability is calculated on the basis of MIPS, ram and bandwidth, as every host is having all the above mentioned parameters greater than the virtual machine, so this gives them advantage. Along with that, because of the failure of the cloudlet 4, reliability of the VM1 decreases. Now the virtual machines are checked so as to know which of them have reliability greater than or equal to SRL. Here, both the virtual machines (2 and 3) have reliability greater than or equal to SRL, so both are now passed. At last, for selecting the most reliable machine, virtual machines are compared according to their priorities. As virtual machine 3 is having higher priority than 2, so it is declared as the more


reliable machine at the end. MIPS of VM1:200, VM2:410, VM3:400

In this case (Figure 5), we have 4 cloudlets running on two virtual machines. Out of all the cloudlets that are running, cloudlet 4 gets failed and then it is passed to the elasticity calculation module.


As the cloudlet is not liable of getting the elasticity, so it is ultimately declared as failed. Because of the failing of the cloudlet, the reliability of the virtual machine hence decreases. At last, Vm2 and Vm3 are declared as passed, as they are having reliability equal to or greater than SRL. At the end, VM 2 is considered as the passed machine as it is having higher priority than the other.

Case 3: Now, only the MIPS value of VM3 is changed,

rest are kept same.

MIPS of VM1:200, VM2:300, VM3:215


In this case (Figure 6), all the cloudlets (1 to 6) get run on the three virtual machines. Out of all the cloudlets, cloudlet 4, 5 and 6 gets failed as they are unable to perform their task within prescribed time. So, they are passed on to the elasticity calculation module, where they are given the chance to become successful. But, as they are not liable of getting the extra CPU cycles .i.e. they need more than 15 cycles to complete, their task, so they are declared as failed. Because of their failure the reliability of corresponding virtual machines decreases. At the end, we are having only one virtual machine which is having reliability equal to or greater than SRL, so it is declared as the most reliable machine. MIPS of VM1:200, VM2:300, VM3:450 In this case (Figure 7), we have 6 cloudlets running on three virtual machines. One cloudlet gets failed. Hence it is passed to the EC module.


VM3: 450 As, it is not liable to getting 15 more CPU cycle, i.e. it requires more than 15 cpu cycles to complete its task, so it is declared as failed. Then at the end, we have two machines having the same reliability. Winning machine is selected according to priority in this situation. As VM3 is having highest MIPS, so it is declared as the more reliable machine.

VI. DISCUSSIONS AND CONCLUSION Fault tolerance is the capacity of the system to work normally even in the presence of any fault in the system. Here, a novel model named as HAFTRC (High Adaptive Fault Tolerance in Real-time Cloud computing) has been proposed. This model works on the principle of adaptive fault tolerance. There are two main modules of the system. One set of module consists of the Virtual Machines on which certain tasks or cloudlets run along with acceptance module. Other set of module is used for elasticity calculation, reliability calculation and decision making.

142 2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

The HAFTRC is a very reliable option as it can be used for fault tolerance for all real time computing applications. The main advantage that this model provides is that it continues to function even if some of the cloudlets fail. It only stops when all the tasks get failed.

VII. FUTURE WORK Like any other model, HAFTRC can also be enhanced up to an extent so that it performs well in cloud computing environment. Some more features or parameters can be added to this model so as to render it more fault tolerant. More parameters can be added so as to check the reliability of the virtual machines, like here we have added, MIPS, ram kind of features. In future some more parameters like PE’s, number of users, hosts etc can be added.. In decision making module, here, we have used priority scheduling to make the selection of more reliable machine. In future, some other technique can be used to make the system more fault tolerant. Also, the concept of elasticity in terms of CPU cycles can be enhanced by increasing it up to a certain level. Along with that, the concept of check pointing can be introduced which allows the user to hold the record of virtual machines which gets failed so that in future these failed machines can be retrieved easily.

REFERENCES [1] Mell Peter, Grance Timothy (2011), The NIST Definition

of Cloud Computing,September,p.7 [2] Harris Torry (2010), CLOUD COMPUTING-An Overview,

Torry Harris Business Solutions, January [3] Velte Anthony T., Velte Toby J., Elsenpeter Robert (2009)

Cloud Computing: A Practical Approach, Tata McGraw Hill [4] http://my.safaribooksonline.com/book/software-engineering-

and-development/ 9788131700693/introduction/section_1.2 [5] W. T. Tsai, Q. Shao, X. Sun, J. Elston, “Real Time

ServiceOriented Cloud Computing”, School of Computing, Informatics and Decision System Engineering Arizona State University USA,

[6] J .Coenen, J. Hooman, “A Formal Approach to Fault Tolerance in Distributed Real-Time Systems”, Department of Mathematics and Computing Science, Eindhoven University of Technology, Nether land

[7] Sheheryar Malik and Fabrice Huet, (2011) “Adaptive Fault tolerance in Real Time Cloud Computing”, 2011 IEEE World Congress on services, (pp. 28-287)

[8] Anjali D. Meshram, A.S. Sambare and S.D.Zade, (2013) “Fault Tolerance Model for Reliable Cloud Computing”, International Journal on Recent and Innovation Trends in Computing and Communication Volume:1 Issue:7, (pp. 600-603)

[9] Sheheryar Malik and M.J. Rehman,(2005) “Time Stamped Fault Tolerance in Real Time systems”, 9th International Multitopic Conference, IEEE INMIC 2005, (pp. 1-5)

[10] M. Young, The Technical Writer's Handbook. Mill Valley, CA: University Science, 1989.

[11] Anjandeep Kaur Rai, Parveen Kumar, Pradheep Manisekaran, (2014) ”High Adaptive Fault Tolerance in Cloud Computing”, IOSR Journal of Engineering (IOSRJEN) Vol. 04, Issue 03 (March. 2014), (pp. 24-27)

[12] https://code.google.com/p/cloudsim/downloads/detail?name=cloudsim-3.0.2.zip& can=2&q=

[13] Raj Gaurav, Munish Katoch, "Security Implementation through PCRE Signature over Cloud Network", Advanced Computing: An International journal, May 2012, Vol. 3 No. 3 ISSN: 2229 - 6727[Online]; 2229 - 726X [Print],pg no. 119-127.

[14] Raj Gaurav, Nitika, shaveta,"Comparative Analysis of Load Balancing Algorithms in Cloud Computing", International Journal of Advanced Research in Computer Engineering & Technology, Vol. 1 No. 3 (2012)(ISSN:2278-1323), pg no. 120 -124.

[15] Raj Gaurav, Ankit Nischal, "Efficient Resource Allocation in Resource Provisioning Policies Over Resource Cloud Communication Paradigm", International Journal on Cloud Computing: Services and Architecture, June 2012, Vol. 2, No. 3, ISSN: 2231 - 5853[Online]; 2231 - 6663 [Print], pg no. 11 - 18.

[16] Raj Gaurav, Kamaljeet Kaur, "Secure Cloud Communication for Effective Cost Management System Through MSBE", International Journal on Cloud Computing: Services and Architecture, June 2012, Vol. 2, No. 3, ISSN: 2231 - 5853[Online]; 2231 - 6663 [Print], pg. no. 19 - 30.