performance of scalable off-the-shelf hardware for data-intensive parallel processing using...

Performance of Scalable Off-The-Shelf Hardware for Data-intensive Parallel Processing using MapReduce

Ahmad Firdaus Ahmad Fadzil, Noor Elaiza Abdul Khalid, Mazani Manaf Faculty of Computer and Mathematical Sciences

Universiti Teknologi MARA (UiTM) Shah Alam, Malaysia

[email protected], [email protected], [email protected]

Abstract—Large data and information processing requires high processing power that usually involve supercomputers which are costly. MapReduce parallel framework introduces an automated way of distributing these large processes to many computers. This paper proposes to conduct preliminary studies on scalability using MapReduce as an automated parallel processing running on low-cost off-the-shelf hardware. The system architecture is built with collections of off-the-shelf hardware. The scalability test will be conducted by adding an off-the-shelf hardware one at a time to the architecture. MapReduce tool is used as a parallel framework to automatically distribute tasks according to available resources. Performance will be evaluated based on improvement in speedup. It is found that MapReduce is able to accommodate scalability of off-the-shelf hardware resources by automatically distributing tasks regardless of the number of hardware being added to the architecture.

Keywords-MapReduce; Parallel processing; Off-the-shelf hardware; scalability

I. INTRODUCTION Evolution of technologies is responsible for generating

tremendous amount of data and information. Internet powerhouse such as Google [1] and Facebook [2] dealt with terabytes of data on daily basis. The magnitude of processing and handling such data and information requires massive computation resources which includes high-end mainframes and supercomputers. These facilities are very costly and not readily available [11,12]. Present personal computers (PC) are equipped with high computing resources. However, these resources often go untapped. Currently, computers usually belong to some connections or networks such as LAN (Local Area Network) and WAN (Wide Area Network) hence the idea of sharing computational resources among interconnected computing nodes was born [7]. Distributed computing is a form of parallel processing that communicate through network such as LAN or WAN, with every computing nodes communicate with each other to complete a common goal [12].

Since most computers are usually underutilized, these resources could be virtually clustered to simulate a super computer. The possibility of using these resources to process vast amount of data and information is intriguing [1]. MapReduce is a parallel processing framework introduced in 2004 by Google to provide efficient solution for handling

massive data and information by employing collections of ready-to-wear hardware [1]. This framework has gained huge popularity due to its feature of hiding complex parallelization implementation [3], thus providing an automated parallelization facility to the user [1]. The enthusiasm towards MapReduce leads into development of various MapReduce framework such as Hadoop [27], Qizmt [15, 28], and Dryad [4]. Research areas such as data mining [5,23], data warehousing [2,24,25], and bioinformatics [6,20,21,22] have acknowledged the capability of this framework.

The purpose of this paper is to perform preliminary studies in terms of scalability in parallel processing architecture. The method of resource handling and tasks management is automated by employing MapReduce parallel framework. The hardware architecture is based on a cluster of off-the-shelf hardware to ensure low-cost implementation.

This paper is organized as follows. Section 2 discusses large data parallel processing using MapReduce framework, addressing crucial factors such as parallelization complexity and scalability issue. Methodology employed in this paper and results of the research are discussed in section 3 and 4 respectively. Finally, the last section of the paper concludes the overall findings and proposes the future work for further research.

II. DATA INTENSIVE PARALLEL PROCESSING USING MAPREDUCE

Demand for processing power increases to support data intensive applications in the current computing era. Distributed computing is a form of parallel processing that is employed as one of the method to process large data and information [7]. While distributed computing offers huge possibilities and potentials due to the usage of off-the-shelf hardware, the challenges on how to effectively utilize the resources in a distributed environment [12] limits its utilization only to big companies and organizations.

Various frameworks to support distributed parallel processing emerged and made public for everyone to access, notably the MPI (Message Passing Interface) [11] but the obstacle in terms of complex parallelization process and scalability issues has become as challenging as ever especially when dealing with off-the-shelf hardware [12]. Despite the fact

- 379 -

https://www.researchgate.net/publication/239764064_Parallelization_of_BLAST_with_MapReduce_for_Long_Sequence_Alignment?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/224198133_Screening_Data_for_Phylogenetic_Analysis_of_Land_Plants_A_Parallel_Approach?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221351758_Dryad_Distributed_data-parallel_programs_from_sequential_building_blocks?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/224185685_Cloud_Technologies_for_Bioinformatics_Applications?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221212933_A_Comparison_of_approaches_to_large-scale_data_analysis?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221600258_The_Construction_and_Test_for_a_Small_Beowulf_Parallel_Computing_System?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221600258_The_Construction_and_Test_for_a_Small_Beowulf_Parallel_Computing_System?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221213679_A_Hadoop_based_distributed_loading_approach_to_parallel_data_warehouses?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/254004736_Rapid_parallel_genome_indexing_with_MapReduce?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/252555193_DisCo_Distributed_Co-clustering_with_Map-Reduce?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

that ready-to-wear hardware allows cost-effective solution to parallel processing, each hardware need an efficient way to communicate with one another in order to ensure effective resource management [1].

Google apparently opted for parallel processing to handle its behemoth services and operations [13] and encountered similar resource utilization problem. This problem arises due to parallelization complexity in distributed environment that requires user to manually handle and manage distributed resources. MapReduce was then introduced to address the issue [1] and has since become a revelation for the distributed computing [14]. This section reveals how MapReduce framework addresses issues on complex parallelization process and scalability and subsequently provided the user with an automated parallelization solution.

A. Complex Parallelization Details The amount of effort in terms of handling distributed

computing resources is very demanding and requires a thorough understanding of how a certain parallelism can be achieved [1]. Load imbalance and ineffective data handling are problems that occur if this matter is not thoroughly addressed [13]. Researchers at Google addresses this problem by creating a whole new method to parallelize a problem, via MapReduce framework

MapReduce framework hides complex parallelization details to the user by providing a functional abstraction for programming purpose and a file system to manage the distributed hardware [1]. The functional abstraction known as the map and reduce function creates a programming model that is expressible on many real world tasks [1]. This argument is further supported by the fact that MapReduce has been employed in a lot of areas such as data mining [5,23], data processing [1,15,18], data warehousing [2,24,25] and bioinformatics [6,20,21,22].

Distributed file system (DFS) is a method developed towards countering resource management problem in distributed environment [8]. DFS allows effective resource handling by combining distributed storage resources into one centralized file system, creating a transparent resource management that allows simpler file manipulation. DFS was at first built independently by Google even before the introduction of MapReduce architecture but has turned to be a crucial component for MapReduce framework. Open source MapReduce frameworks such as Hadoop [27] and Qizmt [28], and Dryad[4] acknowledges the importance of DFS in MapReduce framework, with each decided to employ DFS to complement their respective MapReduce framework.

B. Scalability Issue Scalability issue will inevitably arise in distributed

computing due to the fact that such system needs to grow in order to support future processing requirement [2]. The use of distributed environment especially involving off-the-shelf hardware necessitates a good resource management to ensure the reliability of a parallel framework. This form of support is usually rare in parallel framework and often requires the users to manually handle the extension of their parallel processing

system, resulting in load imbalance and unreliable resource management. Previously created distributed parallel framework such as MPI for example, has a very poor fault-tolerance mechanism. A research suggested that MPI processes are not at all reliable and obliges a feat to keep track of failures within a large-scale application compare to MapReduce [17].

Scalability of a system is divided into structural scalability and load scalability [16]. A system with good scalability allows itself to expand without a good deal changes needs to done to its structure. In the meantime, a system with good load scalability allows itself to “perform gracefully as the offered traffic increased” [16]. MapReduce framework addresses both these issues by providing support to both structural and load scalability. Distributed File System (DFS) that serves as the main feature of MapReduce represents the structural scalability of the framework which allows off-the-shelf hardware such as IDE disks to be connected in a cluster to form large virtual storage for the framework [1,8]. Implementation of MapReduce usually involves Gigabytes to Petabytes of data [1,2,5] that further proves MapReduce framework capability in terms of structural scalability.

Meanwhile, load scalability in this context is represented by the ability of MapReduce framework to allow improvement in terms of performance as the number of nodes used increased. This argument is proven by research that implements this framework which discover addition to the number of nodes further increase the performance of the framework [5,6,13]. MapReduce framework has been tested running on hundreds to thousands of nodes, but performance improvement when a node is added one at a time requires further investigation.

III. METHODOLOGY The methodology following this paper consists of the

course of action in building the MapReduce parallel processing system. The procedure is divided into three different parts, the hardware design, software setup, and algorithm application.

MapReduce Parallel System

Hardware Design

Process of building parallel architecture using off-the-

shelf hardware

Software Setup

Software installation to enable MapReduce parallel

framework

Algorithm application

Implement test program (word occurrence counter) in

MapReduce environment

Figure 1. Methodology Overview

- 380 -

https://www.researchgate.net/publication/239764064_Parallelization_of_BLAST_with_MapReduce_for_Long_Sequence_Alignment?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/252052092_A_comparative_review_of_job_scheduling_for_MapReduce?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/224198133_Screening_Data_for_Phylogenetic_Analysis_of_Land_Plants_A_Parallel_Approach?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221351758_Dryad_Distributed_data-parallel_programs_from_sequential_building_blocks?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221556521_Characteristics_of_scalability_and_their_impact_on_performance?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221556521_Characteristics_of_scalability_and_their_impact_on_performance?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/224185685_Cloud_Technologies_for_Bioinformatics_Applications?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221212933_A_Comparison_of_approaches_to_large-scale_data_analysis?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/221213679_A_Hadoop_based_distributed_loading_approach_to_parallel_data_warehouses?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==



https://www.researchgate.net/publication/3215306_Web_search_for_a_planet_the_Google_cluster_architecture._IEEE_Micro?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==



https://www.researchgate.net/publication/220910111_The_Google_File_System?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==

https://www.researchgate.net/publication/220910111_The_Google_File_System?el=1_x_8&enrichId=rgreq-9f04390e-61df-4bd2-a66b-5c1f3bfef774&enrichSource=Y292ZXJQYWdlOzI2MTM4ODY5MDtBUzoyMTM3MTA3MjQ0NDAwNjRAMTQyNzk2NDAxNTc0MA==




Figure 1 depicts the methodology overview that presents the overall procedures in building the MapReduce Parallel System. Hardware design explains the method on designing the blueprint of the parallel system which includes the overall layout of all the hardware used. The focal point of adapting off-the-shelf hardware serves as the building block to build a cost-effective parallel system.

Software setup on the other hand explicates the method of installing the software required for enabling MapReduce parallel framework and finally algorithm application unveil the implementation of the test program adopted which is the word occurrence counter MapReduce parallel computing system using large text data in order to measure the parallel system's performance.

A. Hardware Design The employment of ready-to-wear hardware is essential

towards this research which allows large data processing to be done inexpensively. This section discusses the hardware design that incorporates the process of building the parallel architecture from scratch.

The hardware design as depicted in figure 2 basically explain how the hardware is clustered to form LAN (Local Area Network) architecture. Network topology suggested is in the form of star topology which a hub or router acts as the liaison between each computing nodes. Fixed IP addresses are reserved according to each machines' MAC (Media Access Control) address using DHCP (Dynamic Host Configuration Protocol) reservation to enable a more effective communication among nodes.

Gigabit Ethernet is utilized over the standard High-Speed Ethernet in order to reduce the communication overhead between the nodes. The speed of 1000 Mbps (Megabit per second) or 1 Gbps (Gigabit per second) allows more efficient flow of data compare to the latter's 10/100 Mbps. This is especially essential for parallel system to ensure less data transfer overhead due to the amount of data interchanged between connected nodes. In this research, four (4) off-the-shelf nodes are used that is interconnected via the network.

TABLE I. INDIVIDUAL NODE SPECIFICATION

Component Specification Processor Intel® Core™2 Duo Processor E7500

(3M Cache, 2.93 GHz, 1066 MHz FSB)

Memory 4 GB DDR2-533 RAM HDD 500 GB SATA Network 1000 Mbps/ 1 Gbps Ethernet adapter

Table I shows the hardware specification for the computing nodes used in order to build the parallel system. Hardware with this level of specifications can be found at most computer store at an affordable price. Next section of this paper will discuss the software setup to complement the hardware architecture.

B. Software Setup A parallel architecture will not be completed without the

software that allows parallelism between the nodes. This section explains the software installation flow to complement the parallel architecture that is built.

Figure 3 shows the software installation flow in order to allow MapReduce parallelism to the architecture. MapReduce framework as stated in the earlier section allows transparency towards parallelism complexity thus providing a straightforward parallelization effort. An online tutorial written by Micheal G. Noll [9] is referred in order to accomplish this task.

Hadoop is renowned as one of the most popular implementation for parallel MapReduce system [3]. Prior to installing Hadoop software, there are dependencies that have to be considered. Hadoop uses SSH to communicate between each nodes thus making iteration of Unix-based OS such as Ubuntu as the suitable platform to deploy the system. The use of JAVA programming language in Hadoop necessitates the installation of JAVA SDK in order to compile and execute Hadoop MapReduce programs.

The final phase in software setup involves setting up Hadoop MapReduce parallel framework. The distribution of new Hadoop release is available in its official website [27]. The parallel framework is then configured to enable MapReduce components such as DFS and MapReduce tasks. The configuration will be changed number of times as evaluation

1

PC 1 PC 2 PC 3 PC 4

2 2 2 2

Gigabit Ethernet (1 Gbps) Router UTP CAT-6 (Gigabit Ethernet Enabled) cable Computing Nodes

1

2

PC

Figure 2. Hardware Design

OS Installation

(Ubuntu)

Install Dependencies: JAVA SDK & SSH Server

Setup & Configure Hadoop MapReduce

Figure 3. Software Installation Flow

- 381 -

needs to done on a single, double, triple and quadruple machines in order to find out the parallel processing system's scalability.

TABLE II. HADOOP MAPREDUCE CONFIGURATION

# Nodes #Map Tasks #Reduce Tasks DFS Size 1 2 2 460 GB 2 4 4 920 GB 3 6 6 1380 GB 4 8 8 1840 GB

Hadoop MapReduce configuration is illustrated in table II above. Even during the early phase of configuration, it is already apparent that MapReduce parallel processing system is both structural and load scalable. The combination of off-the-shelf hardware allows the storage to form an almost two Terabytes (2 TB) of storage resources. The number of map and reduce tasks available is basically the number of processors available to a node. Dual cores machines are being used hence the number of map and reduce tasks is two. In the next chapter, the algorithm application which is used to evaluate the system is discussed.

C. Algorithm Application Algorithm application explains the implementation of a test

program employed in order to measure the performance of the system. Word occurrence counter or word count program is chosen for that purpose.

The word occurrence counter is a program that calculates the amount of time a certain word appears within a text file. Despite counting the occurrence of words in a text appear fairly simple, this task is a feat when the text files to be processed amounting to GB (Gigabytes) or TB (Terabytes) in size. This program is widely used as the preliminary testing benchmark to test the performance of a parallel MapReduce system [1].

Figure 4 explains how the word count program operates using MapReduce. The text file that is used as input data will be copied into DFS. Upon the execution of the word count program, the program invokes the map task to retrieve individual words and maps the key based on the word itself. For example, the word "word" in (key, value) pair form is mapped as (word, 1). The integer "1" will be used later in the reduce function.

The framework will then shuffles and groups the values into respective key groups. For example, the "word" key group will contain the list of integer "1". Depending on how many times the word occurs in the text file, the amount of "1" differs from one key group to another. The reduce function then simply calculates the number of "1" in respective group and lastly turn in the result.

The time of execution in order to complete the program is recorded in order to evaluate the performance of the MapReduce parallel system. The test program is executed by using one to four nodes and 1.5 MB to 2.3 GB text file. Next section will discuss the result obtained from this performance evaluation.

IV. RESULT This section discusses the performance of the MapReduce

parallel computing system using the test program mentioned in earlier. The performance of this system is measured in terms of performance improvement speedup [10, 26].

TABLE III. WORD COUNT ALGORITHM EXECUTION TIME

# Nodes/Size

1.5 MB

12 MB

150 MB

590 MB

1.15 GB

2.31 GB

1 32 s 37 s 72 s 220 s 348 s 642 s 2 33 s 34 s 60 s 111 s 175 s 324 s 3 34 s 36 s 52 s 87 s 137 s 233 s 4 33 s 35 s 46 s 78 s 111 s 180 s

Table III illustrates the execution time taken using different size of text file and different number of nodes. Execution time looks to be level on every number of nodes when processing 1.5 MB to 12 MB of text files. This happens largely due to the insignificant file size and the amount of time taken is basically the time to initialize the MapReduce parallel computing system. The execution time finally shows significant difference when processing larger text files.

Figure 4. Word Count Algorithm using MapReduce

Word Count

Text File

DFS

Map

k1,v1

Map

k2,v2

Map

k3,v3

Map

kn,vn

Red

k1,l1

Red

k2,l2

Red

k3,l3

Red

kn,ln

Shuffle and Group (key, value) pairs

Key = "word", value = 1 (int) = (word, 1) Key = "word", list = list of 1 (int) = (word, list)

Map

Red

- 382 -

Figure 5. Word Count Execution Time of Different Size of File in Different

Number of Nodes

Figure 5 depicts the word count execution time using different number of nodes. The time taken to complete the word count program increases as the size of the text file grew larger in size. This explains the customary trend whereas larger files requires larger amount of processing.

The figure also shows that augment in number of nodes reduce the execution time. The amount of time to execute the test program declines with each increment in the number of nodes. In order to measure how fast the parallel execution compare to its sequential counterpart, (1) [10,26] below is used to calculate the speedup. Table 2 illustrates the execution time taken using different size of text file and different number of nodes. Execution time looks to be level on every number of nodes when processing 1.5 MB to 12 MB of text files.

��

This equation (1) allows calculation of the number of times parallel processing is faster in contrast to the sequential execution.

Figure 6. Speedup in Execution Time using Different Number of Nodes and

File Size

Figure 6 exemplifies the number of times parallel execution is faster than a single node execution. Two-node execution able to achieve twice performance speedup, which essentially cut down the time needed to complete the operation by half.

Using three-node and four-node execution, the maximum speedup able to reach 2.76 and 3.57 times faster performance respectively. The figure also indicates that larger text file is needed in order to acquire maximum speed up. The speedup trend looks to be improved each time additional nodes are introduced, endorsing the argument that MapReduce parallel framework offers a good scalability.

V. CONCLUSION AND RECOMMENDATION It is proven that the MapReduce parallel processing system

provides good load scalability with every addition of nodes allows further performance improvement. The role of DFS also presents MapReduce with structural scalability by allowing large amount of data and information to be stored into the system. These factors are especially crucial when a system needs to grow in order to satisfy the rapidly increasing data and information.

The performance of MapReduce parallel computing system allows large data and information processing to be done on cheap. The employment of off-the-shelf hardware means that even the smallest of organizations can own a very capable parallel computing system of their own. The fact that functional abstraction of MapReduce has proven to be applicable for many real world applications opens up tremendous amount of possibilities. There are many other research areas that craves the capability to process large data especially areas involving parallel-natured algorithm such as emergence and evolutionary algorithm.

ACKNOWLEDGMENT The authors acknowledge with gratitude to Research

Management Institute (RMI), UiTM and financial support from the grant Inoculation of Natural Clustering Algorithm Into Parallel Platform In Determining Nosologic Abnormalities In Medical Images and grant no 600-RMI/ERGS 5/3 (6/2011) from the Ministry of Higher Education (MOHE), Malaysia.

REFERENCES [1] Dean, J., & Ghemawat, S. (2004). MapReduce : Simplified Data

Processing on Large Clusters. OSDI 2004 . [2] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., et al.

(2010). Hive – A Petabyte Scale Data Warehouse Using Hadoop. ICDE Conference 2010 , 996-1005.

[3] Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2011). Parallel Data Processing with MapReduce: A Survey. SIGMOD Record, December 2011 (Vol. 40, No. 4) , 11-20.\

[4] Isard, M., Budiu, M., Yu, Y., Birrell, A., & Fetterly, D. (2007). Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007.

[5] Papadimitriou, S., & Sun, J. (2008). DisCo: Distributed Co-clustering with Map-Reduce. 2008 Eighth IEEE International Conference on Data Mining , 512-521.

- 383 -

[6] Menon, R. K., Bhat, G. P., & Schatz, M. C. (2011). Rapid Parallel Genome Indexing with MapReduce. MapReduce’11,June 8, 2011, San Jose, California, USA , 51-58.

[7] Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Lecture 1: Introduction to Distributed Computing & Systems Background. Retrieved from Distributed Computing Seminar: http://code.google.com/edu/submissions/mapreduce-minilecture/lec1-intro.ppt.

[8] Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System . 19th ACM Symposium on Operating Systems Principles , 29-43.

[9] Noll, M. G. (2012, June 14). Running Hadoop On Ubuntu Linux (Multi-Node Cluster). Retrieved from http://www.michael-noll.com/

[10] N.E.A.Khalid, S.A.Ahmad, N.M.Noor, A.F.A.Fadzil, & M.N.Taib. (2011). Parallel approach of Sobel Edge Detector on Multicore Platform . International Journal of Computers and Communications Issue 4, Volume 5, 2011 , 236-244

[11] Luo, W.-L., Xie, A.-d., & Ruan, W. (2010). The Construction and Test for a Small Beowulf Parallel Computing System. Third International Symposium on Intelligent Information Technology and Security Informatics , 767-770.

[12] Godfrey, B. (2006). A primer on distributed computing. Retrieved from http://www.bacchae.co.uk/docs/dist.html

[13] Barros, L. A., Dean, J., & Hölzle, U. (2003). Web Search for A Planet : The Google Cluster Architecture.

[14] Yoo, D., & Sim, K. M. (2011). A Comparative Review of Job Scheduling for MapReduce. Proceedings of IEEE CCIS2011 , 353-358.

[15] Jin, Y., Hu, M., Singh, H., Rule, D., Berlyant, M., & Xie, Z. (2010). MySpace Video Recommendation with Map-Reduce on Qizmt. Semantic Computing (ICSC), 2010 IEEE Fourth International Conference on , 126-133.

[16] Bondi, A. B. (2000). Characteristics of Scalability and Their Impact on Performance. WOSP 2000, Ontario Canada.

[17] Leu, J.-S., Yee, Y.-S., & Chen, W.-L. (2010). Comparison of Map-Reduce and SQL on Large-scale Data Processing. International Symposium on Parallel and Distributed Processing with Applications , 244-248.

[18] Mackey, G., Sehrish, S., Bent, J., Lopez, J., Habib, S., & Wang, J. (2008). Introducing Map-Reduce to High End Computing.

[19] Zheng, Q. (2010). Improving MapReduce Fault Tolerance in the Cloud. [20] Ekanayake, J., Gunarathne, T., & Qiu, J. (2011). Cloud Technologies

forBioinformatics Applications. IEEE Transactions on Parallel and Distributed Systems, Vol. 22, No. 6 , 998-1011.

[21] Yong, L., Zhen, M., Qi, L., Yuanchun, G. Y., & Jianhu, L. (2010). Screening Data for Phylogenetic Analysis of Land Plants: A parallel Approach . 2010 First International Conference on Networking and Distributed Computing , 305-308.

[22] Yang, X.-l., Liu, Y.-l., Yuan, C.-f., & Huang, Y.-h. (2011). Parallelization of BLAST with MapReduce for Long Sequence Alignment . 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming , 241-246.

[23] Luo, M., & Liu, G. (2010). Distributed log information Processing with Map-Reduce. 1143-1146.

[24] Xu, Y., Kostamaa, P., & Qi, Y. (2011). A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses. SIGMOD’11, June 12–16, 2011 , 1091-1099.

[25] Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., et al. (2009). A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA. , 165-178.

[26] Khalid, N., Noor, N., Ahmad, S., Rosli, M., & Taib, M. (2011). Parallel Laplacian Edge Detection Performance Analysis on Green Cluster Architecture. Digital Enterprise and Information Systems: International Conference, DEIS 2011, London, UK July 20-22, 2011, Proceedings, Volume 194, 308.

[27] Apache Hadoop. http://hadoop.apache.org/ [28] MySpace Qizmt. http://qizmt.myspace.com/

- 384 -

performance of scalable off-the-shelf hardware for data-intensive parallel processing using...

Documents