big d ata analytics using hadoop over cloud · big d ata analytics using hadoop over cloud...

Big Data analytics using Hadoop over cloud

Govinda.K1, M.Srikanth Deelip2

1,2SCOPE, VIT University,Vellore, India [email protected], [email protected]

Abstract. Innovation in technology and using technology for improving success rate are two most important factors for organization/business. The new technology cloud computing is known for storing and processing huge amount of data for low cost. The apriori algorithm based map-reduce implemented on cloud eucalyptus. The cloud stores huge amount of data and process over stored

data is expensive because cloud computing offer multiple services based on pay as you go model. Analyzing huge amount of data in cloud computing require data mining techniques. In this paper, the process of hadoop with multi cluster model is implemented on virtual machines in eucalyptus private cloud infrastructure. The performance of the proposed algorithm tested on multiple nodes in cluster of multiple datasets. Keywords: Cloud Computing, Hadoop, Apriori algorithm, Big data, Analytics.

1 Introduction

Age of expansive amount of information brings about conveying conceivable

outcomes to find and utilize learning in business area from the immense information to build return for capital invested. The information mining apparatus is an agent

instrument that dissects information in various edges and locates the significant

learning from information. Information mining likewise helps in characterization of

information, forecast of qualities, information arrangement and furthermore discover

examples and relationships from the informational indexes. Treatment of vast

measure of information brings about specialized issues like putting away of

information and information trade. The organization of information stream and

information assets between the process and capacity assets is getting to be burglary.

Examine, imagine and spread huge informational collections is the key errand. As of

late information serious processing is dealt with as fourth worldview in revelation

later test, theoretic and calculation science.

Distributed computing is another model having pool of assets with substantial number

of PCs. It fulfills the interest for progression in information stockpiling and gathering.

The undertaking is dispersed among pool of assets with the goal that applications

acquire their product needs in light of interest. It additionally gives colossal capacity

and calculation control which causes us to mine information. Hadoop is the product

structure for composing the applications require fast parallel handling of vast

information on bunches hubs. It has a conveyed record framework to examine and

change informational collections with the assistance of guide lessen worldview. A

International Journal of Pure and Applied MathematicsVolume 118 No. 9 2018, 555-569ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

555

crucial normal for Hadoop is information segment and calculation crosswise over

hubs and executing them parallel. A Hadoop group scales stockpiling limit,

calculation limit and I/O data transfer capacity with the expansion of straightforward

product servers.

Eucalyptus is cloud framework utilizes Programming interface same as Amazon Web

Services. The instruments furnished in Amazon with extra advantage of free and open

source. It can be utilized as a private, open and cross breed cloud with adequate equipment. Eucalyptus Machine Pictures (EMI) are running in Eucalyptus which are

made by client or download as a bundle. EMI contains windows, Linux or CentOS.

Eucalyptus exist on the host working framework. It can't keep running on Windows or

Apple OS as it contains the Linux OS parts. Eucalyptus contacts with its diverse parts

to control the design and set up of the framework. These segments are designed with

setup records in the segments. Every part has distinctive functionalities and duties to

make a framework and handle stockpiling condition, dynamic making of virtual

occurrences and client get to control.

2 Related Work

In the 1990's the Web started to grow up and as of late there is critical

improvement in the system foundation and data transmission of system. Distributed

computing [1] can be named as a framework that contains diverse modules to

coordinate essential parts of an undertaking processing to capacity or control as a solitary purpose of administration. It gives access to vast pools of information

stockpiling, arrange data transmission, framework assets that can be circulated to

number of machines that go about as an independent machine.

In this paper [2] P. Radha Krishna et. al. talked about the distributed computing

innovation and administrations offered over it. The creators likewise clarified how the

scientific administrations are moderate. The issues and difficulties in cloud

investigation are likewise tended to. In cloud examination two sorts of administrations

are given, Investigation as an Administrations (AaaS) and Model as an

Administration (MaaS).

In paper [3] Venture Daytona, Roger S. et. al. proposed another approach for information investigation as a cloud benefit. They tended to the confinements of

spreadsheets and customer applications which does not give versatile calculation to

enormous measure of information investigation. So they created cloud information

investigation benefit in view of Daytona. They utilized iterative MapReduce structure

for information investigation.

In paper [4] towards cloud based examination as an administration for enormous

information investigation in cloud, Farhana Zulkernine et. al. tended to a few

difficulties in distributed computing, for example, accessibility of programming for

International Journal of Pure and Applied Mathematics Special Issue

556

examination, assets estimation, work process of employment, dealing with the

information in cloud. The creators' proposed theoretical design of cloud based

investigation as an administration which is cloud based AaaS stage.

In paper [5] Parallel Representation on Expansive bunches utilizing MapReduce,

Jonathan Bonson et. al. utilized the usefulness of distributed computing for parallel

information handling and furthermore the representation of substantial scale

information. In their work, they assess how the MapReduce system is reasonable for execution of methods for virtualization. The creators actualize the program for

perception of assignments utilizing the MapReduce system and assess the outcomes

by applying these calculations to datasets.

In paper [6], Shen Wang et. al. introduced Illustration (Parallel Irregular segment

based progressive grouping) calculation utilizing MapReduce system. The mapper

and reducer work are utilized to apply nearby various leveled bunching on hubs and

they utilized dendrogram arrangement procedure to coordinate the outcomes. So

comes about are adaptable by utilizing parallel various leveled bunching calculation.

The parallel calculations and structures are exceptionally helpful to accomplish

adaptability and furthermore enhance the exhibitions of information examination. In paper [7] MapReduce for Information Concentrated Logical Investigations, Jaliya

Ekanayake et. al. utilized the MapReduce system and apply it to K-implies grouping

calculation and high vitality material science information examination. The creators

additionally display CGL MapReduce and contrasted the outcomes and Hadoop [8].

In paper [9], displayed Cura, another cloud benefit show utilizing MapReduce

method for information examination in cloud. The creators raised issues about the

current cloud benefit for MapReduce which are rare for generation workloads. To

defeat this, the Cura utilizes MapReduce for making bunch setup for employments

and furthermore a few occupations are postponed, with the goal that cloud benefit

perform improvement of asset distributions and it additionally decreases the cost.

In [10] Information Mining with MAPREDUCE: Diagram and Tensor Calculations

with Applications, Charalampos E. Tsourakakis proposed a novel calculation called

DOULION for including triangles the chart. The creator utilized Hadoop the open

source execution of MapReduce for parallel usage of calculation.

3 Proposed Work

3.1 System Design

Server-1 and server-2 are the two servers present in our framework. Server-1 acts as

manager to the user. Distributed storage and grouping is controlled by srver-1 where customers wishes to access the server-1. Arrangement of virtual machine is done by


557

server-2. VM in each virtual machine acts as Hadoop ace hub and retainer. Here

server-2 acts as receiver by receiving customers demand from server-1. When

demand received by the server-2, ace hub in it distributes the errands to customer

hubs and then forms as yield with them. Errands received by the customer hubs are

processed and their results are given to ace hub. As soon as the ace hub receives the

results it reinforce the effects due to the customer hub. Once all the results are

checked out by the ace hub, results are send to server-1 through which customers gets

his required output for the request.

Fig1. System Architecture

3.2 Methodology

Map Reduce

The output of changed Apriori calculation can be seen in map reduce. Hadoop Circulated Record Framework (HDFS) contains the information. HDFS follows the

constraints of information, remembering the end goal to perform output operation.

Initially, information is made into several divisions and send to the mapper. Mapper

gives his yield as key which deals as an esteem. The output of mapper is send to

combiner who checks out the particular operation from different mappers. Later, this

output is send to combiner who checks out the particular operation from different

mapper. Later, this output is send to the reducer where he ends up with the research

operations. If the checked operations are prominent than the limit, then they are


558

created or else destroyed in yield. This process represents the age of continuous thing

sets-1.

Set-1 in mapper composes the helpful thing set-2 where set-2 contains examples with

appointed tally. Tally is build up and is appointed to all thing sets by the reducer with

limited tally. At the point when there is no age of competitor sets or previous cycle set

is not present then the procedure stops. Age of affliation rules are done after

successive thing sets. For creating affliation thing set count is required for checking. Solitary tab checks thing sets where visit sets are present in yield record when the

incessant thing set inyield organizer consists 100 % level, then it produces or gives

lead.

Fig2. Diagram for Map ReducE

Data set splitting

The technique called “input organise” divides the information into different parts

which are then provided to mapper.

Initially file input format differentiates the information into different blocks with

capacity of 64 MB with the help of “Jobcong” protest and submits to the outline. AES

provides the security for information in the middle of mapper and reducer which executes different errands at a time. if the document massive information then those

errands are given to different hubs in the groups. The information opposed the

swapping of one hub structure to different hubs.

Intelligent parts contains „5‟ as its incentive. While dividing the information,

exchange between two different parts must not be happened. If it all happens then the


559

dividing process starts since the time where the swapping had taken place providing

pointer to the path with remaining swapping. Review of jobs undergoing mapping is

given by input format. Each job is seems to be solitary part. And then, the information

document exists physically, then hubs are being allotted by jobs.

Apriori Algorithm using Map-Reduce

At first the split piece is given to the mapper. Mapper peruses a solitary line at any given Initially divided part is given to the mapper where he persuade and find out the

key for the entire thing. Each key is provided at 1 end goal to gat visit thing sets-1

where the combiner takes the key value. Brief document made by mapper gives is

outline in HDFS to the combiner. Combiner integrates his output with the mapper’s

output and then gives it to the reducer. Finally reducer develops and gives the

characteristics of keys and output is recorded in yield record.

HDFS

Massive information is divided into several parts using HDFS. And this parts belong

to different hubs in the group. In case any failure of the framework, the information in

the failed framework will be given to another framework without any duplicates i.e.

repetition of framework. There are many disappointments with the hubs in the

hadoop. In the above case, information in failure framework is distributed to the other

hubs. Each record must be changed to hadoop from nearest framework and each yield

document in hadoop is represented to the adjacent framework.

Generating association rules

The yield envelope have the check esteems related with visit thing sets after effective

execution of cycles. The yield of the two continuous emphases are perused once. The certainty of thing sets is ascertained with the tally estimations of progressive

successive thing sets. On the off chance that the certainty estimation of thing set is

100% then the administer is considered as a fascinating guideline. For instance,

consider the supp_count of {a, b, c} in emphasis 2 is 50 and supp_count of {b,c} is 50

in cycle 1. The connection is {b, c} => {a}, as the supp_count > edge min_supp and

certainty is 100%.

The continuous thing set's things are upheld with a comma. Bolster check is

ascertained for each continuous thing set. The subsets bolster check is gotten from

past cycles. At that point the certainty esteem for a specific thing is ascertained. In the

event that the certainty is 100 then that is intriguing tenet and it is given more need in

later cycle than past ones.

Implementation

To check the performance of the proposed Apriori data mining technique by

deploying in private cloud using Eucalyptus software. The client has the data file

with connections to two servers running on ubuntu 12.04 connected through gigabit,

the OS package manager is used for eucalyptus.


560

Storage, Cluster, Cloud and walrus is in server1. NC on server2 all node instances

are running on ser2. Virtual instances are running on node controller, in this work

four virtual instances run on ser2. Hadoop is configured on virtual instances and one

VM acts as a master and remaining are slaves.

Node controller (NC) is installed on Sever-2. All the virtual machines are running on

node controller. In this project we are using 4 virtual machines running on server-2. Hadoop is installed and configured on each virtual machine. One virtual machine is

act as master node and other three as salves.

Eucalyptus Setup

Eucalyptus setup is done on two servers.

1. The repository of eucalyptus is stored in //etc//apt//soure list data file and create a

list of file in //etc//apt//soure list data file and add content in

http://download.eucalypts.com//sw//eucalyptuss // 3.2//ubuntu in precise main

location.

2. Eucalyptus tool is created and content is added in https://Download.eucalypt.com/ /sw/tools// 2.2//ubuntu precise in the main

3. Enter commands on all instances to update.

4. The commands are executed on ser1 to run cclc.cc.sc and cwalrus aapts-gets

installed eucalypts-ccloud eucalypts-c eucalypts – s1c eucalypts- cwalrus.

5. The command is on ser2 to run node controller apts-gets run eucalypts- ncc.

Network Configuration

The //etc//eucalypts//eucalypts.configuration file has options for network section. This

include setting up in virtual mode and WLAN mode to run the data on servers

VLNET_. UPDATE VLNET_MODE to VLNET_MODE= MANAGE-VLAN

Controller Configuration

The responsibility of controller is to communicate with hypervisor in eucalyptus,

through VIRUSH which is virtual binary in Linux. The network interface card is to

bridge the VLNET and the network interface to add these:

ifacce ether0 inett dhcp

auto load

ifacce load inett loopsback auto bridge

ifacce bridge0 inett dhcp

bridge port ether0


561

Reboot the network deamons with the following cmd „//ect//init.demon//network

restart‟ to reflect network changes.

Configuration of Cloud Manager

To configure cloud manager the eucalypts.conff file is to be edited to do following changes

DISABLE_eDNS=NO

SCHEDPOLICY=ROUNDROBIN SCHEDULE

VLNET_DHCPDAEMONS=//usr//sbin//dhcpd3

VLNET_DHCPUSERS= eithernet dhcpd3/ root

VLNET_SUBNETS=10.10.0.0

VLNET_MASKS=255.255.0.0

VLNET_ADDRESSNETS=32

VLNET_PUBLICIPSR=120.239.48.193-120.239.48.22

and run to reflect the changes.

Obtain the credentials

After running and boot process, Cloud Controller and users need to get passwords

through browser or at command prompt.

This can be done either through a web browser, or at the command line.

1. Use „admin‟ as username and password for first time

2. To change the passward follow the on-screen instructions

3. To modify the password as per users, click on „credentials‟ located in left top

corner.

4. The credentials can be downloaded by clicking on the tab.

5. And save in ~/eucalypts and extract using „unzip –d ~ eucalypts credentials‟

Instances Running

1. create a secure key through ssh key running for log in instance for root

2. Access the port number 22 for the instance 3. The registered image instance is created through

Eucalypts authorise –P TCP 22 –shh 1.1.1.1/1

4. Exit the ssh.

Hadoop Configuration

1. Download the hadoop2-2.2.1 version to copy all nodes in the cluster.


562

2. Create hdusers under hadoop2-2.2.1 on nodes. The users will run hadoop

file.

3. Extract the hadoop2-2.2.1 and store in '//home//hduser//Downloads//'. 4. See the following changes on nodes

4.1 open the conf//core-sitrr.xml and the following property xml file </properties>

<<name>>fs.default.names<</name>>

<<value>>hadoopfs:/masters:01345<</value>>

</properties>

4.2 open the conf//hadoopfs-site.xml

<properties> <<name>hadoopfs.replication<</name>>

<<value>>4<</value>>

</properties> 4.3 open the conf//mapreduce-site.xml

<properties>

<<name>>mapreduce.jobs.trackers<</name>>

<<value>>masters:11345<</value>> </properties>

5. The master changes

5.1 open the conf//master to add master

5.2 open the conf//slave to do following

Master node

slave1 node

slave2 node

slave3 node 6. on slaves

7. Run the changes to reflect on all nodes

Programming the Map- Reduce Functions

1. Use the cmd „ hd namenode ~ format‟ to format name node.

2. Use „start ~ all ssh‟ on master node to all daemons 3. The i/p dir ts created using „ hdfs ~mkdir new name‟

4. Insert the data file in this directory.

5 For all classes a jar file is created to run apriori algorithm on input data set.

6. The output is stored in hadoop file system 'hdfs ~cat//user//hadoop-user//output//part1'.

7. To stop type „Stop ~all the .sh‟ files.


563

4 Result and Analysis

The enactment of different sets of data are evaluated on hadoop environment by

changing number of nodes such as 2-nodes,3-nodes and 4-nodes with changing

support percentage. The quality of service is measured includes time required for

generating frequent itemset-2 , itemset-3 and association rule generation.

The following describes the performance of proposed algorithm is evaluated on T10.I4.D100K, retail and mushrooms data sets .

On Single Node Table1.Execution Time(ms) on Single Node

Dataset Minimum Support Percentage

.86

0.74

0.64

0.4

T10I4D100K 81046 111628

113684 155074

RETAIL 64733 57810

70136 72725

MUSHROOM

S

118067 130825

172761 101212

Fig3. Execution Time(ms) on Single Node


564

On Two Nodes

Table2. Execution Time(ms) on Two-Nodes

Dataset Minimum Support Percentage

.95

0.85

0.75

0.5

T10I4D100K 63622 92106 98028 112075

RETAIL 58841 62014 64222 77715

MUSHROOM 96674 10286

3

10141

1

126213

Fig4. Execution time(ms) on two-nodes

On Four Nodes

Table3. Execution Time(ms) on Four-Nodes Dataset Minimum Support Percentage

0.96

0.74

0.65

0.4

T10I4D100K 63611 71822 78811 90129

RETAIL 40112 42725 48262 53128

MUSHROOM 71132 86363 97323 10184


565

Fig5. Execution Time (ms) on Four-Nodes

The execu time on three different data sets T10I4D100k, retail and mushroom after

applying apriori algorithm shown in the above graphs. The Apriori calculation

executed with changing least help rate and it is demonstrated that the execution time

of calculation increments as least help rate diminishes.

Fig 3,4 and 5 shows the execution time decreasing after applying apriori algorithm on

various hubs in Hadoop clusters. hadoop cluster shows better execution time with

different size of data sets. by increasing number of hubs in hadoop environment,it can

shows better performance in terms of execution time

4 Conclusion

Hadoop is able to support all sort of servers environment. In this papers, hadoop map-

reduce implemented on eucalyptus. the vm's are running in eucalyptus which provide

enough resources for users up to physical machine limit of host. with help of instances

in hadoop clusters the frequent items can be mining efficiently. the data is distributed in hadoop cluster environment by splitting entire data into small data sets. the failures

at storage of nodes in cluster will itself be taken care of hadoop. the small data sets

always shows better performance compare with large data sets. the proposed

algorithm can support data items which is having minimum support by using

association rules .


566

References

1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf.Very Large Data Bases, pages 487-499, Santiago, Chile, September 1994.

2. Huy T. Vo, Jonathan Bronson, Brian Summa, Jo˜ao L.D. Comba, Juliana Freire, “Parallel Visualization on Large clusters using MapReduce”.

3. Shen Wang, Haimonti Dutta,“PARABLE: A PArallel RAndom-partition Based HierarchicaL

ClustEring Algorithm for the MapReduce”, Center for Computational Learning Systems – CCLS Columbia University.

4. Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox,“MapReduce for Data Intensive Scientific Analyses”.

5. Balaji Palanisamy Aameek Singh,Ling Liu,Bryan Langston,“Cura: A Cost-optimized Model for MapReduce in a Cloud”, 2013 IEEE 27th International Symposium on Parallel & Distributed Processing.

6. Charalampos E. Tsourakakis,“Data Mining with MAPREDUCE: Graph and Tensor

Algorithms with Applications”,March 2010. 7. Andrew Pavlo,,Erik Paulson, Alexander Rasin,Daniel J. Abadi, David J. DeWitt, Samuel

Madden,Michael Stonebraker,“A Comparison of Approaches for Large-Scale Data Mining”, SIGMOD '09 Proceedings of the 2009 ACM SIGMOD International Conference on Management of data Pages 165-178.

8. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. CACM, 51(1):107–113, 2008.

9. Eucalyptus Systems, Inc.Eucalyptus 3.3.1, Eucalyptus Administration Guide (2.0), 2010.

10 Eucaluptys Systems, Inc. Eucalyptus Administration Guide (2.0), 2010. 11. Ananthanarayanan R, Gupta K, Pandey P.,Himabindu P.,Sarkar P.,Shah M., and ebu Tewari

R., “Cloud analytics : Do we really need to reinvent the storage stack?”. 12. Armbrust M, Fox A.,”A view od cloud computing”, Communication of ACM,Volume 53

Issue 4,Apil 2010 Pages 50-58. 13. Amazon. Amazon simple storage service (s3). http://aws.amazon.com/s3, 2011. Retrieved

2011-02-15. 14. J Dean and S Ghemawat. Mapreduce: Simplified data processing on large clusters. In

OSDI'04: Sixth Symposium on Operating System Design and Implementation. Google Inc., 2004.

15. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th USENIX Symp. on Operating Systems Design & Implementation (OSDI), 2004.

16. P. Radha Krishna, Kishore Indukuri Varma, “Cloud Analytics : A Path towards next generation affordable BI,Infosys.

17. Roger S. Barga, Jaliya Ekanayake, Wei Lu,” Project Daytona: Data Analytics as a Cloud Service”, 2012 IEEE 28th International Conference on Data Engineering.

18. Farhana Zulkernine, Michael Bauer, Ashraf Aboulnaga ,“ towards cloud based analytics as a service for big data analytics in cloud”, 2013 IEEE International Congress on Big Data.

19. UCI repository. 20. A. T. Velte, T. J. Velte, and R. Elsenpeter. Cloud Computing - A Practical Approach. The

McGraw-Hill Companies, 2010. 21. T. Mather, S. Kumaraswamy, and S. Latif. Cloud Security and Privacy. O’Reilly, 2009. 22. Eucaluptys Systems, Inc. Eucalyptus Administration Guide (3.1), 2013. 23. http://miles.cnuce.cnr.it/~palmeri/datam/DCI/datasets.php.

24. http://en.wikipedia.org/ 25. http://hadoop.apache.org/ 26. en.wikipedia.org/wiki/MapReduce/


567

27.Source:http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/ CDH4/ Hadoop-Tutorial/ht_overview.html.

28. http://en.wikipedia.org/wiki/SOAP

29. http://soapclient.com/soapsecurity.html 30. https://en.wikipedia.org/wiki/Big_data


568

big d ata analytics using hadoop over cloud · big d ata analytics using hadoop over cloud...

Documents