the impact of virtualization on high performance computing ...r.abid/research/achahbar... · high...

The Impact of Virtualization on High Performance

Computing Clustering in the Cloud

Master Thesis Report

Submitted in

Fall 2013

In partial fulfillment of the requirements for the degree of

Master of Science in Software Engineering at the School of

Science and Engineering of Al Akhawayn University in Ifrane

By

Ouidad ACHAHBAR

Supervised by

Dr. Mohamed Riduan ABID

Ifrane, Morocco

January, 2014

2

Acknowledgment

I would like to express my deepest and sincere gratitude to ALLAH for giving me guidance

and strength to complete this work, and for having the chance to study and accomplish my

master degree with high support from my family, friends and professors. Thank you ALLAH.

I would also like to deeply thank my supervisor Dr. Abid for trusting me to conduct this

research, providing me with valuable feedback and overseeing my progress in a weekly basis.

Thank you Dr. Abid for your motivation and support.

My gratitude also goes to Dr. Haitouf who provided me with valuable comments and shared

with me his knowledge in cloud computing and distributed systems. Thank you Dr. Haitouf.

I am most thankful to my dear parents, brothers, sisters, nephews and fiancé for their

continuous support, encouragement and love. There are no words to express my gratitude to

all of you.

Many thanks go to my very close friends: Nora El Bakraoui Alaoui, Inssaf El Boukari, Sara

El Alaoui, Aida Tahditi, Jamila Barroug, Wafa Bouya and Chahrazad Touzani. Thank you for

being always by my side; thank you for sharing enjoyable moments with me, and thank you

for being my friends.

Last but not least, special acknowledgements go to all my professors for their support, respect

and encouragement. Thank you Ms. Hanaa Talei, Ms. Asmaa Mourhir, Dr. Naeem Nizar

Sheikh, Mr. Omar Iraqui, Dr. Violetta Cavalli Sforza, Dr. Kevin Smith and Dr. Harroud.

Ouidad Achahbar

3

Abstract

The ongoing pervasiveness of Internet access is largely increasing big data production. This,

in turn, increases demand on compute power to process the massive data, and thus rendering

High Performance Computing (HPC) into a high solicited service.

Based on the paradigm of providing computing as a utility, the cloud is offering user-friendly

infrastructures for processing these big data, e.g., High Performance Computing as a Service

(HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization

technique since the latter controls the creation of virtual machines instances that carry data

processing jobs.

In this thesis, we characterize and evaluate the impact of machine virtualization on HPCaaS.

We track HPC performance under different cloud virtualization platforms, namely KVM and

VMware ESXi, and compare it to the performance in a physical computing cluster

infrastructure. The virtualized environment is deployed using Hadoop on top of Openstack.

The resulting HPCaaS runs MapReduce algorithms on benchmarked big data samples using a

granularity of 8 physical machines per cluster.

We got several interesting results when we ran the selected benchmarks on virtualized and

physical cluster. Each tested cluster provided different performance trends. Yet, the overall

analysis of the research findings proved that the selection of virtualization technology can

lead to significant improvements when running and handling HPCaaS.

4

ملخص

. هذا بدوره الضخمة اناتعديد من البيالفي تزايد إنتاج اإلنترنت سببا رئيسيا واستعمال وجيعتبر التفشي المستمر لظاهرة ول

ية "حوسبة عالمن خدمة ه المؤشرات جعلت ذ. هالبيانات هذه لمعالجةقدرات حسابية عالية يؤدي إلى زيادة الطلب على

مثيرة لإلهتمام. كخدمةاألداء"

عالجة البيانات السحابية بنيات تحتية مرنة اإلستعمال لم الحوسبة تقدمستنادا إلى نموذج توفير الحوسبة كأداة مساعدة، ا

البيئة قنية بت كبيربشكل ةاألخير ه". مع ذلك، يقترن أداء هذكخدمةعالية األداء الحوسبة ال، على سبيل المثال، "الضخمة

لبيانات. الجة ابوظائف مع تقومالتي ( فتراضيةاال الحواسب)األالت االفتراضية شاء في إن هاتحكماالفتراضية نظرا إلى

ا أيضا بتتبع أداء االفتراضية على "الحوسبة العالية األداء كخدمة". قمنالبيئة في هذه األطروحة، قمنا بوصف و تقييم تأثير

. قمنا يوترثمان أجهزة كمبة من نمكوى حوسبة مادية ة وعلمختلف برامج سحابية افتراضية على"الحوسبة العالية األداء"

لى باستخدام "أوبن ستاك" لبناء "الحوسبة العالية األداء كخدمة"، و "هادوب" لتشغيل خوارزميات "ماب رديوس" ع

.كبيرةبيانات

ةبنيال) ةحوسبت، نوعية المن خالل نتائج هذا البحث، الحظنا تغير مهم في أداء " الحوسبة العالية األداء" بتغير حجم البيانا

ية البيئة االفتراضية ان تقن الذي وصلنا اليه يثبت االستناجف بالرغم من ذالك،وحجم الحوسبة. المادية واالفتراضية( :تحتيةال

.بر في تحسين أداء "الحوسبة العالية األداء"تلها دور مهم ومع

5

Table of Content

Acknowledgment 2

Abstract 3

4 ملخص

Table of Content 5

List of Figures 7

List of Tables 9

List of Appendices 10

List of Acronyms 11

PART I: THESIS OVERVIEW 12

Chapter 1: Introduction 13

1.1. Background 13 1.2. Motivation 14 1.3. Problem Statement 15 1.4. Research Question 15 1.5. Research Objective 15 1.6. Research Approach 15 1.7. Thesis Organization 16

PART II: THEORETICAL BASELINES 17

Chapter 2: Cloud Computing 18

3.1. Cloud Computing Definition 18 3.2. Cloud Computing Characteristics 19 3.3. Cloud Computing Service Models 20 3.4. Cloud Computing Deployment Models 21 3.5. Cloud Computing Benefits 22 3.6. Cloud Computing Providers 23

Chapter 3: Virtualization 24

4.1. Definition of Virtualization 24 4.2. History of Virtualization 25 4.3. Benefits of Virtualization 25 4.4. Virtualization Approaches 26 4.5. Virtual Machine Manager 28

Chapter 4: Big Data and High Performance Computing as a Service 32

5.1. Big Data 32 5.2. High Performance Computing as a Service (HPCaaS) 33

Chapter 5: Literature Review and Research Contribution 35

5.1. Related Work 35 5.2. Contribution 36

PART III: TECHNOLOGY ENABLERS 37

Chapter 6: Technology Enablers Selection 38

6.1. Cloud Platform Selection 38 6.2. Distributed and Parallel System Selection 40

6

Chapter 7: Openstack 42

7.1. OpenStack Overview 42 7.2. OpenStack History 42 7.3. OpenStack Components 43 7.4. OpenStack Supported Hypervisors 49

Chapter 8: Hadoop 50

8.1. Hadoop Overview 50 8.2. Hadoop History 50 8.3. Hadoop Architecture 51 8.4. Hadoop Implementation 52 8.5. Hadoop Cluster Connectivity 55

PART III: RESEARCH CONTRIBUTION 57

Chapter 9: Research Methodology 58

9.1. Research Approach 58 9.2. Research Steps 58

Chapter 10: Experimental Setup 59

10.1. Experimental Hardware 59 10.2. Experimental Software and Network 60 10.3. Clusters Architecture 60 10.4. Experimental Performance Benchmarks 64 10.5 Experimental Datasets Size 65 10.6 Experiment Execution 66

Chapter 11: Experimental Results 67

11.1. Hadoop Physical Cluster Results 67 11.2. Hadoop Virtualized Cluster- KVM Results 72 11.3. Hadoop Virtualized Cluster- VMware ESXi Results 77 11.4. Results Comparison 82

Chapter 12: Discussion 88

12.1. TeraSort 88 12.2. TestDFSIO 90 12.3. Conclusion 91

PART IV: CONCLUSION 92

Chapter 13 93

Conclusion and Future Work 93

Bibliography 94

Appendix A: OpenStack with KVM Configuration 100

Appendix B. OpenStack with VMware ESXi Configuration 127

Appendix C: Hadoop Configuration 131

Appendix D: TeraSort and TestDFSIO Execution 145

Appendix E: Data Gathering for TeraSort 147

Appendix F: Data Gathering for TestDFSIO 153

7

List of Figures

Figure 1: Thesis organization ................................................................................................................ 16

Figure 2: NIST visual model of cloud computing definition ................................................................ 19

Figure 3: services provided in cloud computing environment .............................................................. 21

Figure 4: Full virtualization architecture .............................................................................................. 26

Figure 5: Paravirtualization architecture .............................................................................................. 27

Figure 6: Hardware assisted virtualization architecture ....................................................................... 28

Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor ........................................................................... 29

Figure 8: Xen hypervisor architecture ................................................................................................... 30

Figure 9: KVM hypervisor architecture ................................................................................................ 31

Figure 10: VMware ESXi architecture ................................................................................................. 31

Figure 11: Data growth over 2008 and 2020 ........................................................................................ 32

Figure 12: Active cloud community population .................................................................................... 38

Figure 13: Active distributed systems population ................................................................................. 40

Figure 14: OpenStack conceptual architecture ..................................................................................... 44

Figure 15: Nova subcomponents ........................................................................................................... 44

Figure 16: Glance subcomponents ........................................................................................................ 46

Figure 17: Keystone subcomponents ..................................................................................................... 46

Figure 18: Swift subcomponents ........................................................................................................... 47

Figure 19: Cinder subcomponents ......................................................................................................... 48

Figure 20: Quantum subcomponents ..................................................................................................... 48

Figure 21: Apache Hadoop subprojects ............................................................................................... 51

Figure 22: Hadoop Architecture ............................................................................................................ 52

Figure 23: HDFS and MapReduce representation ................................................................................. 53

Figure 24: Word count MapReduce example ....................................................................................... 55

Figure 25 : Research steps ..................................................................................................................... 58

Figure 26 : Hadoop Physical Cluster ..................................................................................................... 61

Figure 27: Hadoop Physical Cluster architecture .................................................................................. 61

Figure 28: Hadoop virtualized cluster - KVM ...................................................................................... 62

Figure 29: Hadoop virtualized cluster – VMware ESXi (a) .................................................................. 63

Figure 30 : Hadoop virtualized cluster – VMware ESXi (b) ................................................................. 64

Figure 31 : Experimental execution ...................................................................................................... 66

Figure 32: TeraSort performance on Hadoop Physical Cluster ............................................................ 67

Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster ........................................ 68

Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster............................................. 68

Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster........................................... 68

Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster........................................... 68

Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster .............................................. 69

Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster ............................ 70

Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster ......................... 70

Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster ........................... 70

Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster .......................... 70

Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster ............................................... 71

Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster .......................... 71

Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster ............................. 71

Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster ............................. 72

Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster .......................... 72

Figure 47: TeraSort performance on Hadoop KVM Cluster ................................................................. 72

8

Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster ............................................ 73

Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster ................................................. 73

Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster ................................................ 73

Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster ............................................... 73

Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster ................................................. 74

Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster .............................. 75

Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster ................................... 75

Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster ................................. 75

Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster .............................. 75

Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster .................................................... 76

Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster ............................... 76

Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster .................................... 76

Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster ................. 77

Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ................ 77

Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster ................................................. 77

Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster ............................. 78

Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster ................................... 78

Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster ............................... 78

Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster ................................. 78

Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster ................................... 79

Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster .............. 80

Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster .................... 80

Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster ................ 80

Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster ............... 80

Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster ................................... 81

Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster .............. 81

Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster .................... 81

Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster ................ 82

Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ................ 82

Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and ................................... 83

Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi................ 83

Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and ...................................... 84

Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and ...................................... 84

Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi ................ 85

Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi ............. 85

Figure 83: Average time for wrting 100 GB on HPhC, HVC with KVM ............................................. 86

Figure 84: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi .................. 86

Figure 85 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi ....... 86

Figure 86: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi .... 87

Figure 87 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87

Figure 88: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs .... 89

Figure 89 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs89

Figure 90: OpenStack warning statistics about system’ resources usage .............................................. 90

9

List of Tables

Table 1 : A Comparison of cloud deployment models ......................................................................... 22

Table 2 : Cloud IaaS selection ............................................................................................................... 39

Table 3 : Parallel and distributed platform selection ............................................................................. 41

Table 4 : OpenStack releases ................................................................................................................ 43

Table 5 : OpenStack projects ................................................................................................................. 43

Table 6: Apache Hadoop subprojects .................................................................................................... 51

Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) ............................. 59

Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster ............. 60

Table 9 : OpenStack virtual machines’ features .................................................................................... 60

Table 10 : Experimental performance metrics ...................................................................................... 64

Table 11 : Datasets size used for Hadoop benchmarks ......................................................................... 65

Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different

number of nodes- Hadoop Physical Cluster .......................................................................................... 67

Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and

different number of nodes- Hadoop Physical Cluster ........................................................................... 69

Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and

different number of nodes- Hadoop Physical Cluster ........................................................................... 71


number of nodes- Hadoop KVM Cluster .............................................................................................. 72


different number of nodes- Hadoop KVM Cluster ................................................................................ 74

Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and

different number of nodes- Hadoop KVM Cluster ................................................................................ 76


different number of nodes- Hadoop VMware ESXi Cluster ................................................................. 77

Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and




10

List of Appendices

Appendix A : OpenStack with KVM Configuration……………………………………………...….100

Appendix B : OpenStack with VMware ESXi Configuration……………………………………….127

Appendix C: Hadoop Configuration………………………………………………….....……………131

Appendix D: TeraSort and TestDFSIO Execution…………………………………….… ………….145

Appendix E: Data Gathering for TeraSort……………………………………………..……………..147

Appendix F: Data Gathering for TestDFSIO…………………………………………………………153

11

List of Acronyms

HPC High Performance Computing

HPCaaS High Performance Computing as a Service

VM Virtual Machine

VMM Virtual Machine Manager

EMC American Multinational Corporation

DCI Digital Communications Inc.

GFS Google File System

HDFS Hadoop Distributed File System

NDFS Nutch Distributed File System

DOE Department of Energy National Laboratories

NIST National Institute of Standards and Technology

SaaS Software as a Service

PaaS Platform as a Service

IaaS Infrastructure as a Service

NoSQL Not Only Structured Query Language

SNIA Storage Networking Industry Association

ACID Atomicity, Consistency, Isolation and Durability

AWS Amazon Web Services

HPhC Hadoop Physical Cluster

HVC Hadoop Virtualized Cluster

SSH Secure Shell

JSON JavaScript Object Notation

XML Extensible Markup Language

API Application Programming Interface

Amazon EC2 Amazon Elastic Compute Cloud

Amazon S3 Amazon Simple Storage Service

VLAN Virtual Local Area Network

DHCP Dynamic Host Configuration Protocol

http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud

12

Part I: Thesis Overview

This part introduces the key points to understand the purpose of the present research. It

provides an introduction of the research starting with its background, motivation, problem

statement, research question, objective and research methodology.

13

Chapter 1: Introduction

In this chapter, we first come to the background of the present research, and then describe the

motivation and the problem behind conducting this study. After that, questions, objectives,

and methodology of the research are stated. Finally, an outline of the thesis is given out at the

end of this chapter.

1.1.Background

During the last decades, the demand for computing power has steadily increased as data

generated from social networks, web pages, sensors, online transactions, etc. is continuously

growing. A study done in 2012 by American Multinational Corporation (EMC), has estimated

that from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000

exabytes), and therefore, digital data will be doubled every two years [1]. The growth of data

constitutes the “Big Data” phenomenon.

As Big Data grows in terms of volume, velocity and value, the current technologies for

storing, processing and analyzing data become inefficient and insufficient. Gartner survey

stated that data growth is considered as the largest challenge for organizations [2]. Stating this

issue, High Performance Computing (HPC) has started to be widely integrated in managing

and handling Big Data. In this case, HPC is used to process and analyze Big Data related to

different problems including scientific, engineering and business problems that require high

computation capabilities, high bandwidth, and low latency network [3].

However, HPC still lacks the toolsets that fit the current growth of data. In this case, new

paradigms and storage tools were integrated with HPC to deal with the current challenges

related to data management. Some of these technologies include, providing computing as a

utility (cloud computing) and introducing new parallel and distributed paradigms.

Cloud computing plays an important role as it provides organizations with the ability to

analyze and store data economically and efficiently. Performing HPC in the cloud was

introduced as data has started to be migrated and managed in the cloud. Digital

Communications Inc. (DCI) stated that by 2020, a significant portion of digital data will be

managed in the cloud, and even if a byte in the digital universe is not stored in the cloud, it

will pass, at some point, through the cloud [4]. Performing HPC in the cloud is known as

High Performance Computing as a Service (HPCaaS). In short, HPCaaS offers high-

14

performance, on-demand, and scalable HPC environment that can handle the complexity and

challenges related to Big Data [5].

One of the most known and adopted parallel and distributed systems is MapReduce model

that was developed by Google to meet the growing of their web search indexing process [6].

MapReduce computations are performed with the support of data storage system known as

Google File System (GFS). The success of both Google File System and MapReduce inspired

the development of Hadoop which is a distributed and parallel system that implements

MapReduce and Hadoop Distributed File System (HDFS) [7]. Nowadays, Hadoop is widely

adopted by big players in the market because of its scalability, reliability and low cost of

implementation. Stating this, Hadoop is also proposed to be integrated with HPC as an

underlying technology that distributes the work across HPC cluster [8, 9].

1.2.Motivation

Many solutions have been proposed and developed to improve computation performance of

Big Data. Some of them tend to improve algorithms efficiency, provide new distributed

paradigms or develop powerful clustering environments. Though, few of those solutions have

addressed a whole picture of integrating HPC with the current emerging technologies in terms

of storage and processing.

As stated before, some of the most popular technologies currently used in hosting and

processing Big Data are cloud computing, HDFS and Hadoop MapReduce[10]. At present,

the use of HPC in the cloud computing is still limited. The first step towards this research was

done by the Department of Energy National Laboratories (DOE), which started exploring the

use of cloud services for scientific computing [11]. Besides, in 2009, Yahoo Inc. launched

partnership with major top universities in United States to conduct more research about cloud

computing, distributed systems and high computing applications.

HPCaaS still needs more investigation to decide on appropriate environments that can fit high

computing requirements. One of the HPCaaS’ aspects that is not yet investigated is the impact

of different virtualization technologies on HPC in the cloud. Therefore, the motivation of this

research consists in the need for evaluating HPCaaS performance using MapReduce and

different virtualization techniques. This motivation is accompanied by a strong rational that is

addressed by the free accessibility to MapReduce and cloud computing open sources.

15

1.3.Problem Statement

Cloud computing is offering set of services for processing Big Data; one of these services is

HPCaaS. Still, HPCaaS performance is highly affected by the underlying virtualization

techniques which are considered as the heart of cloud computing. Stating this, the problem

addressed in this research is formulated as follow: “HPCaaS is still facing poor performance

and still doesn’t fit Big Data requirements”.

1.4.Research Question

Addressing the problem statement, this thesis aims at bringing answers to the following

research questions:

1. What is the performance of HPC on Hadoop Physical Cluster (HPhC)?

2. Is it worth moving HPC to the cloud?

3. How virtualization techniques affect HPCaaS performance?

4. Is there an optimal virtualization technique that can ensure good performance?

1.5.Research Objective

The purpose of the present research is to find solutions for the addressed issues and questions

in the previous sections. Hence, this research introduces a new architecture that can handle

HPC complexity and increase its performance. The proposed architecture consists of building

a Hadoop Virtualized Cluster (HVC) in a private cloud using OpenStack. Hence, the first goal

of this research is to investigate the added value of adopting virtualized cluster, and the

second goal is to evaluate the impact of virtualization techniques on HPCaaS.

1.6.Research Approach

To evaluate HPCaaS over different virtualization technologies, we followed both qualitative

and quantitative research methodologies. The qualitative approach was adopted to select

appropriate technology enablers that will be used in building an architecture that will solve

the issues addressed in this study. On the other hand, quantitative approach was adopted to

conduct different experiments on three different clusters: Hadoop Physical Cluster (HPhC),

Hadoop Virtualized Cluster using KVM (HVC- KVM) [12] and Hadoop Virtualized Cluster

using VMware ESXi (HVC- VMware ESXi) [13]. Each experiment tends to measure the

performance of HPC.

16

1.7.Thesis Organization

The rest of this thesis is structured as follow (Figure 1):

Part I covers chapter 1 (current chapter) which introduces the present research.

Part II covers chapter 2, 3, 4 and 5. Chapter 2 provides basic understanding of cloud

computing; chapter 3 introduces virtualization; chapter 4 presents the concept of Big

Data and HPCaaS, and chapter 5 lists some related work and states clearly our

contribution

Part III covers chapter 6, 7 and 8. Chapter 6 explains the steps we followed in selecting

the technology enablers of this research, and chapter 7 and 8 present in details OpenStack

and Hadoop respectively.

Part IV covers chapter 9, 10, 11 and 12. Chapter 9 presents the methodology adopted in

conducting this research; chapter 10 demonstrates the environment preparation to run the

needed experiments; chapter 11 introduces the results, and chapter 12 discusses the

research findings.

Part V covers chapter 13 which concludes the research findings and proposes some future

work; further, this part includes bibliography and appendices of this study.

Figure 1: Thesis organization

17

Part II: Theoretical Baselines

The objective of this part is to elaborate and shed light on some scientific concepts, theories

and topics that serve as a foundation to understand the whole picture of the present research.

Hence; this part is structured as follow: chapter 2 demonstrates basic background of cloud

computing; chapter 3 introduces cloud computing related technologies, namely virtualization;

chapter 4 presents Big Data and HPaaS, and chapter 5 situates this research by introducing

previous research that were done in the domain of evaluating HPC.

18

Chapter 2: Cloud Computing

Cloud computing becomes the current innovative and emerging trend in delivering IT services

that attract both the interest of academic and industrial fields. Using advanced technologies,

cloud computing provides end users with a variety of services, starting from the hardware

level services to the application level. Cloud computing is understood as utility computing

over the Internet. Meaning, computing services have moved from local data centers to hosted

services which are offered over the Internet and paid based on pay-per-use model [14]. This

chapter provides an overview of cloud computing concept. It provides a distinct definition of

what cloud computing is; defines cloud computing characteristics, describes cloud service and

deployment models, discusses some cloud computing benefits, and finally this chapter lists

some cloud computing providers.

3.1.Cloud Computing Definition

In the late 1960’s, John McCarthy brought a new concept into computer science field which

predicts that technology will not be only provided as tangible products [14]. Meaning,

computer resources will be provided as a service like water and electricity. The concept was

known as utility computing, and nowadays it known as cloud computing.

Cloud computing is defined by NIST (National Institute of Standards and Technology) [15] in

2009 as:

“Cloud computing is a model for enabling ubiquitous,

convenient, on-demand network access to a shared

pool of configurable computing resources (e.g.,

networks, servers, storage, applications, and services)

that can be rapidly provisioned and released with

minimal management effort or service provider

interaction. This cloud model is composed of five

essential characteristics, three service models, and

four deployment models. ”

NIST definition of cloud sheds light on the effective use of cloud computing in terms of

providing minimum management efforts of the shared resources. It sets five characteristics

that define cloud computing: on-demand self-service, broad network access, resource pooling,

rapid elasticity and measured service. Concerning the deployment models, NIST has

classified them into: private, public, community and hybrid cloud. More details about cloud

characteristics, delivery and deployment models are provided in the upcoming subsections.

19

The NIST definition of cloud is summarized in Figure 2 which encapsulates cloud computing

characteristics, service models, and deployment models.

Figure 2: NIST visual model of cloud computing definition [14]

3.2.Cloud Computing Characteristics

NIST has listed five main characteristics that describe precisely cloud computing, which are

[15]:

On-demand self-service: end users can use and change computing capabilities as desired

without the need of human interaction with each service provider.

Broad network access: resources are accessed over network using standards mechanism.

Resource pooling: the provider’s computing resources are pooled to serve multiple

consumers; these resources are dynamically assigned and reassigned according to

consumer demand. Examples of resources include storage, processing, memory, and

network bandwidth.

Rapid elasticity: cloud providers can elastically scale in and scale out resources

depending on current end users’ demand. Therefore, resources can be available for

provisioning in any quantity at any time.

Measured service: resources usage can be monitored, controlled and measured;

therefore, these features enable end users to pay using the pay as you go model.

Other characteristics were investigated in [16], and which are listed as follow:

20

Reliability: this feature is ensured by implementing and providing multiple redundant

sites. Having this feature, cloud computing is considered as an ideal solution for disaster

recovery and business critical tasks.

Customization: cloud computing allows customization of infrastructure and applications

based on end user’ demand.

Efficient resource utilization: this feature ensures delivering resources as long as they

are needed.

3.3. Cloud Computing Service Models

Based on NIST definition of cloud computing, cloud deployment models are classified as

follow:

Software as a Service (SaaS)

Software as a Service (SaaS) represents application software, operating system and computing

resources. End users can view the SaaS model as a web-based application interface where

services and complete software applications are delivered over the Internet. Some examples of

SaaS applications are: Google Docs, Microsoft Office Live, Salesforce Customer Relationship

Management, etc.

Platform as a Service (PaaS)

This service allows end users to create and deploy applications on provider’s cloud

infrastructure. In this case, end users do not manage or control the underlying cloud

infrastructure like network, servers, operating systems, or storage. However, they do have

control over the deployed applications by being allowed to design, model, develop and test

them. Examples of PaaS are: Google App Engine, Microsoft Azure, Salesforce, etc.

Infrastructure as a Service (IaaS)

This service consists of a set of virtualized computing resources such as network bandwidth,

storage capacity, memory, and processing power. These resources can be used to deploy and

run arbitrary software which can include operating systems and applications. Examples of

IaaS providers are Drop Box, Amazon web service, etc.

Cloud services are summarized in Figure 3.

21

Figure 3: services provided in cloud computing environment [16]

3.4.Cloud Computing Deployment Models

Private Cloud

Private cloud computing is provisioned for exclusive use by an organization. The cloud in this

case is owned, managed and operated by the organization, a third party, or both of them. The

advantage of private cloud consists in providing high security since the cloud is accessed by

trusted entities within the organization [15].

Public Cloud

The cloud infrastructure is provisioned for general public use. It may be owned, managed, and

operated by cloud service provider who offers services based on pay-per-use model. In

contrast to private cloud, public cloud is known as untrustworthy environment [15].

Community Cloud

The cloud infrastructure is provisioned for exclusive use by a specific community of

consumers from different organizations that share some goals (e.g., mission, security

requirements, policy, and compliance considerations). In this case, the cloud may be owned,

managed, and operated by one or more organizations in the community, a third party, or

combination of them [15].

Hybrid Cloud

This cloud is a combination of both private and public cloud computing environments. Hybrid

cloud provides high flexibility and choices for organization; for instance, critical core

activities of an organization can be run under the control of the private part of the hybrid

cloud while other tasks may be outsourced to the public part [17].

Table 1 summarizes cloud deployment models discussed above [17].

22

Table 1 : A Comparison of cloud deployment models [17]

3.5.Cloud Computing Benefits

Nowadays, cloud is widely used because of the benefits it provides to end users. Some of the

key benefits offered by the cloud include [17, 18]:

Initial Cost Savings

Organizations or individuals can save the big initial investment for launching new hardware,

products and services; in this case, cloud computing platform offers the needed resources in

terms of infrastructure, platform and applications.

Scalability

Cloud computing ensures high computing scalability by scaling up resources as needed.

Therefore, when the usage increases, resources increase relatively to respond to end user’

demand.

Availability

Cloud providers have the infrastructure and bandwidth to accommodate business

requirements for high speed access, storage and systems.

Reliability

Cloud computing implements redundant paths to support business continuity and disaster

recovery.

23

Maintenance

End users are not concerned with the resources maintenance since it is done by the cloud

service provider.

3.6.Cloud Computing Providers

There are many providers who offer cloud services with different features and pricing. Some

of them are listed as follow [16, 19]:

Amazon Web Services

Amazon (AWS) [20] offers a number of cloud services for all business sizes. AWS ensures

advanced data privacy techniques to protect users’ data. For that reason, AWS got various

security certifications and audits such as ISO 27001, FISMA moderate and SAS 70 Type II.

Some AWS services are: Elastic Compute Cloud, Simple Storage Service, SimpleDB

(relational data storage service that stores, processes and queries data sets in the cloud), etc.

Google

Google [21] offers high accessibility and usability in its cloud services. Some of Google

services include: Google’s App Engine, Gmail, Google Docs, Google analytics, Picasa (a tool

used to exhibit product and uploading their images in the cloud), etc.

Microsoft

Microsoft [22] offers a famous cloud platform called Windows Azure which runs Windows

applications. Some other services include: SQL Azure, Windows Azure Marketplace (an

online market to buy and sell applications and data), etc.

OpenStack

OpenStack [23] is an open source platform for public and private cloud computing that aims

at ensuring scalability and flexibility. It was founded by Rackspace hosting and NASA.

Some other organizations that invest in the cloud are: Dell, IBM, Oracle, HP, Sales force, etc.

[16].

24

Chapter 3: Virtualization

There are many different existing technologies and practices used by cloud providers; some of

them are internet protocols for communication, virtual private cloud provisioning, load

balancing and scalability, distributed processing, high performance computing technologies

and virtualization [24]. This chapter emphasizes an understanding of virtualization technology

as it is considered the core of cloud computing. It describes in details the history, benefits,

types and the abstract layer of virtualization.

4.1.Definition of Virtualization

Virtualization is a widely used term; it has been introduced for many years as a powerful

technology in computer science. The definition of virtualization can change depending on

which component of computer system is applied on. However, it is broadly defined as an

abstract layer between physical resources and their logical representation [25]. NIST has

defined virtualization as [26]:

Furthermore, Virtualization is defined by SNIA (Storage Networking Industry Association) as

follow [27]:

From both definitions, we can say that virtualization is a methodology of dividing a physical

machine into multiple execution environments that allow multiple tasks to run

simultaneously. This is done by providing a software abstract layer that is called Virtual

“The simulation of the software and/or hardware upon which other

software runs. This simulated environment is called a virtual machine

(VM). There are many forms of virtualization, distinguished primarily by

computing architecture layer. For example, application virtualization

provides a virtual implementation of the application programming

interface (API) that a running application expects to use, allowing

applications developed for one platform to run on another without

modifying the application itself. The Java Virtual Machine (JVM) is an

example of application virtualization; it acts as an intermediary between

the Java application code and the operating system (OS). Another form of

virtualization, known as operating system virtualization, provides a

virtual implementation of the OS interface that can be used to run

applications written for the same OS as the host, with each application in

a separate VM container”.

“The act of abstracting, hiding, or isolating the internal functions of a

storage (sub) system or service from applications, host computers, or general

network resources, for the purpose of enabling application and network-

independent management of storage or data”.

25

Machine Manager (VMM) or Hypervisor. VMM is therefore designed to hide the physical

resources from the operating system. In this case, VMM allows creating multiple guest

Operating Systems (OS) (each guest is run by software units called Virtual Machines (VM)

[28].

4.2.History of Virtualization

The roots of virtualization go back to the first visualized IBM mainframes that were designed

in the 1690s, and which allowed the company to run multiple applications and processes

simultaneously. In fact, the main drivers behind introducing virtualization were the high cost

of hardware and the need for running and isolating applications on the same hardware. During

1970s, the adoption of virtualization technology increased sharply in many organizations

because of cost effectiveness. However, in 1980s and 1990s, hardware prices dropped down

as well as the emergence of multitasking operating systems. With these facts, there was no

need to assure a high CPU utilization, and therefore, there was no need for virtualization

technology. Yet, in the 1990s, virtualization technology brought again to the market after

introducing VMware Inc. at Stanford University. Nowadays, virtualization is widely used to

reduce management costs by replacing a bunch of low-utilized servers by a single server [29].

4.3.Benefits of Virtualization

There a bunch of reasons that push many organizations to go for virtualization technology;

some of them are listed in [24, 29, 30] as follow:

Server Consolidation

It condenses multiple servers into one physical server that would host many virtual machines.

This feature allows the physical server to run at high rate of utilization, and it reduces at the

same time the hardware maintenance, power and cooling requirements’ costs.

Application Consolidation

Legacy applications might require newer hardware and/or operating systems. In this case,

virtualization can be used to virtualize the new requirements.

Sandboxing

Virtualization can provide secure and isolated environment by running virtual machines that

can be used to run foreign or less-trusted applications.

Multiple Simultaneous OS

26

It can provide the facility of having multiple simultaneous operating systems that can run

different types of applications.

Reducing Cost

Virtualization reduces cost deployment and configuration by ensuring less hardware, less

space and less staffing. Furthermore, virtualization reduces the cost of networking by

requiring less wirings, switches and hubs.

4.4. Virtualization Approaches

Virtualization can take different forms depending on which component of computer system is

applied on [31]. In this section, we will shed light on three famous virtualization techniques:

Full Virtualization, Para-virtualization, and Hardware Assisted Virtualization.

4.4.1. Full Virtualization

In full virtualization, guest OS is fully abstracted from the hardware level by adding

virtualization layer: VMM or hypervisor. In this case, the guest OS is not aware it is being

virtualized, and it requires no modifications. This approach provides each VM with all

services of the physical system, including virtual BIOS, virtual devices and virtualized

memory management. To manage the communication between different layers, full

virtualization provides both binary translation and direct execution techniques (Figure 4).

Binary translation is used to convert guest OS instructions into host instructions. On the other

hand, application or user level instructions are directly executed on the processor to ensure

high performance [32]. Microsoft Virtual Server is an example of full virtualization.

Figure 4: Full virtualization architecture [32]

27

4.4.2. Paravirtualization

The fundamental issue with full virtualization is the emulation of devices within the

hypervisor. This issue was solved by developing paravirtualization technique which allows

the guest OS to be aware that it's being virtualized and to have direct access to the underlying

hardware. In paravirtualization, the actual guest code is modified to use a different interface

that accesses the hardware directly or the virtual resources controlled by the hypervisor [32].

In more details, paravirtualization changes the OS kernel to replace non-virtualized

instructions with hypercalls that communicate directly with the hypervisor. Thus, when a

privileged command is to be executed on the guest OS, it is delivered to the hypervisor

(instead of the OS) by using a hypercall; the hypervisor receives this hypercall and accesses

the hardware to returns the needed result (Figure 5). Xen is one of the systems that adopt

paravirtualization technology.

Figure 5: Paravirtualization architecture [32]

The downside of paravirtualization is that the guest must be modified to integrate hypervisor

awareness. This is a limitation as some operating systems do not allow such modifications

(e.g. Windows 2000/XP), and even the ones that can be modified may need additional

resources for maintenance/troubleshooting [32].

4.4.3. Hardware Assisted Virtualization

Hardware Assisted Virtualization allows VMM to run directly on the hardware. In this case,

VMM controls the access of the guest OS to the hardware resources. As depicted in Figure 6,

privileged and sensitive calls are sent directly to the hypervisor, removing the need for binary

translation and paravirtualization. VMWare ESX Server is one of the main competing VMMs

that use this approach [29].

28

Figure 6: Hardware assisted virtualization architecture [32]

4.5.Virtual Machine Manager

As defined before, hypervisor or VMM is the layer between the operating system and a guest

operating system or the layer between the hardware and the guest operating systems. In [25],

the author has set three main features that need to be maintained by VMM. First feature

demonstrates that VMM has to provide an environment that is identical with the original

machine that we want to virtualize. Second feature shows that programs running on VM or

original machine should show the same performance, or, with some minor decrease. Finally,

last feature states that VMM needs to control all system resources provided to VMs.

4.5.1. Hypervisor Types

Hypervisors are classified into Type 1 Hypervisor and Type 2 Hypervisor. Type 1 runs

directly on the system hardware, and therefore they monitor the operating system guests and

they allocate all the needed resources including disk, memory, and CPU and I/O peripherals.

Having no intermediary between Type 1 hypervisor and the physical layer has led to an

efficient performance in terms of hardware access and security level (Figure 7-a). On the

other hand, Type 2 hypervisor runs on host operating system that provides virtualization

services such as I/O and memory management (Figure 4-b). Having an intermediary layer

between the hypervisor and the hardware makes the installation process easier than Type 1

hypervisor since the operating system is in charge of hardware configuration such as

networking and storage [33].

29

Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor [33]

The differences between Type 1 and Type 2 hypervisor can lead to different performance

results. The layer between the hardware and the hypervisor in Type 2 makes the performance

less efficient than in Type 1. A sample scenario that illustrates this difference is when a virtual

machine requires a hardware interaction (reading from disk); in this case, Type 2 hypervisor

needs first to pass the request to the operating system and then the hardware layer. Besides

performance efficiency, the reliability of Type 1 hypervisor is higher than in Type 2

reliability. For instance, the failure in operating system can directly affect the hosted guests in

Type 2 hypervisor; therefore, the availability of hypervisor type 2 is highly related to the

operating system availability. However, hypervisor type 2 has some advantages which consist

in having fewer hardware/driver issues as the host operating system is responsible for

interfacing with the hardware [34].

4.5.2. Examples of Hypervisors

a) Xen Hypervisor

Xen hypervisor is a Type 1 or bare metal hypervisor that is widely used for paravirtualization

[35]. It is managed by a specific privileged guest (privileged VM) called Domain-0 (Dom0).

Dom0 runs on the hypervisor, and it is responsible of managing all aspects of other

unprivileged virtual machine that are known as DomainU (DomU). Furthermore, Dom0 has

direct access for the resources on the physical, which is not the case for DomU guests [36].

Overall architecture of Xen hypervisor is shown in Figure 8.

30

Figure 8: Xen hypervisor architecture

Xen uses paravirtualization as well as full virtualization. In paravirtualization, DomU are

referred to DomU PV Guests, and they can be modified Linux operating systems, Solaris,

FreeBSD, and other UNIX operating systems [37]. DomU PV Guests are aware that they are

running in a virtualized environment, and they don’t have direct access to the hardware

resources. In this case, the guest operating system is modified to make special calls

(hypercalls) to the hypervisor for privileged operations, instead of the regular system calls in a

traditional unmodified operating system. On the other, in full virtualization, DomU are

referred to as DomU HVM Guests and run standard any unchanged operating system [37].

DomU HVM is not aware that it is sharing processing time on the hardware, and it is not

aware of the presence of other virtual machines. In this case, DomU HVM requires processors

which specifically support hardware virtualization extensions (Intel VT or AMD-V).

Virtualization extensions allow for many of the privileged kernel instructions (which in PV

were converted to "hypercalls") to be handled by the hardware using the trap-and-emulate

technique.

b) KVM Hypervisor

KVM hypervisor provides a full virtualization solution based on Linux operating system. It

works by reusing the hardware assisted virtualization extensions that were already developed.

In this case, KVM requires the presence of Intel VT or AMD-V extensions on the host

system. When KVM is loaded, it converts the kernel into a bare metal hypervisor. As a result,

it takes; as mentioned above, a full advantage of many components which are already present

within the kernel such as memory management and scheduling [38]. KVM is implemented

using two main components; the first one is the KVM-loadable module that, when installed in

the Linux kernel, provides management of the virtualization hardware (Figure 9). The second

component provides PC platform emulation, which is offered by a modified version of

31

QEMU. QEMU is executed as a user-space process, coordinating with the kernel for guest

operating system requests [39].

Figure 9: KVM hypervisor architecture

c) VMware ESXi Hypervisor

VMware was the first leader company that contributed to virtualization technology. One of its

virtualization products is VMware ESXi which is installed directly on top of the physical

machine [40]. VMware ESXi was introduced in 2007 to provide the highest levels of

reliability and performance to companies of all sizes. The overall architecture of VMware

ESXi is illustrated in Figure 10. The main component is the vmkernel which contains all the

necessary processes to manage VMs. It provides certain functionality similar to that found in

other operating systems, such as process creation and control, signals, file system, and process

threads. Therefore, vmkernel supports running multiple virtual machines and provides some

core functionalities like: Resource scheduling, I/O stacks and Device drivers [24].

Figure 10: VMware ESXi architecture [40]

32

Chapter 4: Big Data and High Performance

Computing as a Service

As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of

users and data generated, the capacity and computing power of current data tools lead to

inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates

that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today

has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of

data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The

increase in data size to many terabytes and petabytes is known as Big Data. To handle the

complexity of Big Data, HPC is adopted to provide high computation capabilities, high

bandwidth, and low latency network. This chapter provides an overview of Big Data

phenomena and HPaaS concept.

Figure 11: Data growth over 2008 and 2020 [54]

5.1.Big Data

5.1.1. Big Data Definition

Big Data is defined as large and complex datasets that are generated from different sources

including social media, online transactions, sensors, smart meters and administrative services

[43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of

storing, analyzing and processing data. Literature reviews on Big Data divided the concept

into four dimensions: Volume, Velocity, Variety and Value [43].

33

Volume: the size of data generated is very large, and it goes from terabytes to petabytes.

Velocity: data grows continuously at an exponential rate.

Variety: data are generated in different forms: structured data, semi-structured and

unstructured data. These forms require new techniques that can handle data

heterogeneity.

Value: the challenge in Big Data is to identify what is valuable as to be able to capture,

transform and extract data for analysis.

5.1.2. Big Data Technologies

With Big Data phenomenon, there is an increasing demand for new technologies that can

support the volume, velocity, variety and value of data. Some of the new technologies are

NoSQL, parallel and distributed paradigms and new cloud computing trends that can support

the four dimensions of big data.

NoSQL (Not Only Structured Query Language) is the transition from relational databases to

non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability

to replicate and to partition data over many servers, and the ability to provide high

performance operations. However, moving from relational to NoSQL systems has eliminated

some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability)

[45]. In this context, NoSQL properties are defined by CAP theory [46] which states that

developers must make trade-off decisions between consistency, availability and partitioning.

Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and

CouchDB [50].

Other supporting technologies for Big Data are parallel and distributed paradigms (e.g.

Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in

the upcoming chapters (Part III- Chapter 8, 9).

5.2. High Performance Computing as a Service (HPCaaS)

5.2.1. HPCaaS Overview

High Performance Computing (HPC) is used to process and analyze large and complex

problems, including scientific, engineering and business problems that require high

computation capabilities, high bandwidth, and low latency network [3]. HPC fits these

requirements by implementing large physical clusters. However, traditional HPC faces a set

34

of challenges that consist in peak demand, high capital, and high expertise to acquire and

operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of

new technology trends including, cloud technologies, parallel processing paradigms and large

storage infrastructures. Merging HPC with these new technologies has proposed new HPC

model, called HPC as a service (HPCaaS).

HPCaaS is an emerging computing model where end users have on-demand access to pre-

existing needed technologies that provide high performance and scalable HPC computing

environment [52]. HPCaaS provides unlimited benefits because of the better quality of

services provided by the cloud technologies, and the better parallel processing and storage

provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some

HPCaaS benefits are stated in [51] as follow:

High Scalability: resources are scaling up as to ensure essential resources that fit users’

demand in terms of processing large and complex datasets.

Low Cost: End-users can eliminate the initial capital outlay, time and complexity to

procure HPC.

Low Latency: by implementing the placement group concept that ensures the execution

and processing of data in the same rack or on the same server.

5.2.2. HPCaaS Providers

There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin

Computing [53] which has been a leader in designing and implementing high performance

environments for over a decade. Nowadays, it provides HPCaaS with different options: on-

demand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services

(AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform

HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different

pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is

currently used for Computer Aided Engineering, molecular modeling, genome analysis, and

numerical modeling across many industries including Oil and Gas, Financial Services and

Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure

HPC) [56] and Google (Google Compute Engine) [57].

35

Chapter 5: Literature Review and Research

Contribution

In order to bridge the gap between the present research and previous studies, a review was

conducted on the current state of HPC and virtualization. Therefore, this chapter situates the

research in relation to previous research publications and states clearly the research

contribution.

5.1. Related Work

There have been several studies that evaluated the performance of high computing in the

cloud. Most of these studies used Amazon EC2 [20] as a cloud environment [58-63]. Besides,

only few studies have evaluated the performance of high computing using the combination of

both new emerging distributed paradigms and cloud environment [64].

In [58], authors have evaluated HPC on three different cloud providers: Amazon EC2, GoGrid

Cloud and IBM Cloud. For each cloud platform, they run HPC on Linux virtual machines

(VM), and they came up to the conclusion that the tested public clouds do not seem to be

optimized for running HPC applications. This was explained by the fact that public cloud

platforms have slow network connections between virtual machines. Furthermore, authors in

[13] evaluated the performance of HPC applications in today's cloud environments (Amazon

EC2) to understand the tradeoffs in migrating to the cloud. Overall results indicated that

running HPC on EC2 cloud platform limits performance and causes significant variability.

Besides Amazon EC2, a research done in [63] evaluated the performance-cost tradeoffs of

running HPC applications on three different platforms. First and second platform consist of

two physical clusters (Taub and Open Cirrus cluster), and the third platform consists of

Eucalyptus. Running HPC on these platforms led authors to conclude that cloud is more cost-

effective for low communication-intensive applications.

In order to understand the performance implications on HPC using virtualized resources and

distributed paradigms, authors in [64] performed an extensive analysis using Eucalyptus (16

nodes) and other technologies such as Hadoop [7], Dryad and DryadLINQ [65], and

MapReduce [6]. The conclusion of this research suggested that most parallel applications can

be handled in a fairly and easy manner when using cloud technologies (Hadoop, MapReduce,

36

and Dryad); however, scientific applications, which require complex communication patterns,

still require more efficient runtime support.

Evaluating HPC without relating it to new cloud technologies was also performed using

different virtualization technologies [66, 67, 68, 69]. In [66], authors performed an analysis of

virtualization techniques including VMWare, Xen, and OpenVZ. Their findings showed that

none of the techniques match the performance of the base system perfectly; yet, OpenVZ

demonstrates high performance in both file system performance and industry-standard

benchmarks. In [67], authors compared the performance of KVM and VMware. Overall

findings showed that the VMWare performs better than KVM. Still, in few cases KVM gave

better results than VMWare. In [68], authors conducted quantitative analysis of two leading

open source hypervisors, Xen and KVM. Their study evaluated the performance isolation,

overall performance and scalability of virtual machines for each virtualization technology. In

short, their findings showed that KVM has substantial problems with guests crashing (when

increasing the number of guests); however, KVM still has better performance isolation than

Xen. Finally, in [69] authors have extensively compared four hypervisors: Hyper-V, KVM,

VMWare, and Xen. Their results demonstrated that there is no perfect hypervisor.

5.2.Contribution

So far, there are only few studies that compared different virtualization techniques and its

impact on HPC in the cloud. The only study we found was done in [70], where authors

compared the performance of adopting Xen, KVM and Virtual Box. Each virtualization

technology was compared with bare-metal using a set of high performance benchmarking

tools. The results of this research demonstrated that KVM is the best choice for HPC in the

cloud because of its rich features and near-native performance.

The contribution of this present research will fill the literature gap by examining the impact of

virtualization techniques on HPCaaS using OpenStack as a cloud platform and Hadoop as a

distributed and parallel system.

37

Part III: Technology

Enablers

This part explains the use of OpenStack and Hadoop as underlying technologies for this

research. Hence, this part starts first with providing a qualitative study for selecting an

appropriate cloud platform and distributed system; second chapter of this part introduces in

details OpenStack components, and third chapter presents Hadoop and its main aspects.

38

Chapter 6: Technology Enablers Selection

The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built

after conducting a qualitative study of available tools in the market. We targeted mainly open

sources to select appropriate cloud computing platform and distributed system. Hence, this

chapter presents the analysis we followed in selecting cloud platform and distributed system.

6.1.Cloud Platform Selection

To compare available cloud open sources, we tried to choose the most popular platforms. The

selection of competing platforms was based on a study that compares the popularity of

OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12,

the study showed that OpenStack has the largest total population index, followed by

Eucalyptus, CloudStack, and Opennebula.

Figure 12: Active cloud community population [71]

Based on Figure 12, we selected to compare and study OpenStack, Opennebula and

Eucalyptus. To adopt one of these cloud open sources, we used some other studies that

compare their performance and quality [72-75].

In [72], authors compared some open and commercial cloud platforms. Concerning open

platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they

adopted a set of criteria, including storage, virtualization, network, management, security and

vendor support. The results of the research showed that open-source and commercial solutions

39

can have comparable features, and that OpenNebula is the most feature-complete cloud

platform when it is compared with Eucalyptus.

[73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors

compared the performance of both cloud platforms based on measuring the time when the

cloud starts instantiating VMs and the time when they are ready to accept SSH connections.

The findings of the research demonstrate that OpenStack is slightly better than OpenNebula

due to smaller instantiation time. Moreover, the results showed that OpenStack is more

suitable for high computing due to faster instantiation of large number of VMs. In [74],

authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula.

For the qualitative analysis, they adopted some of the following criteria: security,

virtualization supported, access, image support, resource selection, storage support, high-

availability support and API support. Based on the results of the qualitative study, authors

concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would

benefit in case of persistent storage support. For the quantitative analysis, authors measured

the deployment, network overhead and the clean-up time of VMs. The results of quantitative

analysis showed that each platform can be used depending on user requirements and

specifications.

In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack,

OpenNebula and CloudStack. To perform the comparison, authors adopted the following

criteria: storage, network, security, hypervisor, scalable and installation code openness. In

short, the results of this study [75] showed that OpenStack is the preferred cloud open source.

Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go

for OpenStack as it is known for its flexibility and total openness.

Table 2 : Cloud IaaS selection

40

6.2.Distributed and Parallel System Selection

To compare available distributed and parallel systems in the market, we opted again for the

popularity index of those systems. The selection of competing systems was based on a study

done in [76]. The study is summarized in Figure 13 which compares the popularity index of

Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The

study was done in 2012, and it demonstrates the total downloads between January 2011 and

March 2012. Figure 13 depicts that Hadoop is the most popular distributed system, followed

by MongoDB and Cassandra.

Figure 13: Active distributed systems population [76]

Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in

order to end up with one selected system for the present research.

MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data

in tables with columns and rows. To provide high redundancy and make data highly available,

MongoDB offers replication across multiple servers. While data is synchronized between

servers using replication, MongoDB also facilitates the scale out option by supporting

sharding which partitions a collection and stores the different portions on different machines.

MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On

the other hand, Hadoop is an open source for distributed file system that supports processing,

analyzing and storing large data sets across large clusters using MapReduce paradigm and

HDFS [7]. More details about Hadoop are included in chapter 8.

41

A study done in [77] compares MongoDB and Hadoop systems. The study came up with three

main conclusions; first, it is not appropriate to use MongoDB as an analytics platform;

second, using Hadoop for MapReduce jobs is several times faster than using the built-in

MongoDB MapReduce capability, and third, MongoDB is much slower than HDFS. Besides,

a study was done in [78] did a comparison of Map-Reduce Performance of Hadoop and

MongoDB. In short, the study showed that MongoDB is roughly four times slower than

Hadoop in fully-distributed mode.

Table 3 summarizes the selected distributed system in [77] and [78]. Based on this table, we

decided to go for Hadoop as an analytical and storage tool for the present research.

Table 3 : Parallel and distributed platform selection

42

Chapter 7: Openstack

OpenStack is an open source platform for public and private cloud computing that aims at

ensuring scalability and flexibility. It was developed by a wide range of developers and

contributors using mainly Python (68%), XML (16%) and JavaScript (5%) [79]. This chapter

provides detailed description of Openstack including, brief history; its components, the

corresponding architecture, and finally some supported hypervisors.

7.1.OpenStack Overview

The formal definition of OpenStack was stated in [80], which defines OpenStack as:” a cloud

operating system that controls large pools of compute, storage, and networking resources

throughout a datacenter, all managed through a dashboard that gives administrators control

while empowering their users to provision resources through a web interface”. From this

definition, OpenStack is considered as an Infrastructure as a Service (IaaS).

An important feature of OpenStack is that it provides a web interface called dashboard and

APIs that make its services available via Amazon EC2 and S3 compatible APIs. This feature

ensures that all existing tools that work with Amazon’s cloud platform, can also work with

OpenStack platform [81].

7.2.OpenStack History

OpenStack was a collaboration project between Rackspace Hosting and NASA. Both

organizations planned to release internal cloud project object storage and compute. Rackspace

contributed with their Cloud Files platform to support the storage part of OpenStack, while

NASA contributed with their Nebula platform to support the compute part [82]. In July 2010,

both organizations released the first version of OpenStack under Apache 2.0 License. In

September 2012, OpenStack Foundation was established as an independent entity with the

mission of protecting, empowering, and promoting OpenStack software. Now, OpenStack

project is currently supported by more than 150 companies including AMD, Intel, Canonical,

Red Hat, Cisco, Dell, HP, IBM and Yahoo! [83].

7.1.OpenStack Releases

OpenStack releases different versions with new improvement and contributions. All

OpenStack releases since 2010 are listed in Table 4 [79].

43

Table 4 : OpenStack releases [79]

7.3.OpenStack Components

The core components of OpenStack software are: OpenStack Compute Infrastructure (Nova);

OpenStack Object Storage Infrastructure (Swift) and OpenStack Image Service Infrastructure

(Glance). Besides these components, OpenStack include Identity Service (Keystone),

Network Service (Quantum), Dashboard Service (Horizon) and Block Storage (Cinder).

Table 5 summarizes the main components of OpenStack and the corresponding code name.

Table 5 : OpenStack projects

Taking into consideration the previous mentioned OpenStack components, a conceptual

architecture of OpenStack is provided in Figure 14 which shows how OpenStack components

are interconnected [79].

44

Figure 14: OpenStack conceptual architecture [79]

7.3.1. OpenStack Compute (Nova)

Nova provides flexible management for virtual machines by allowing users to create, update,

and terminate virtual machines. The overall architecture of Nova (Figure 15) is composed of

the following sub-components: nova-api, nova-scheduler, nova-compute, nova-volume, queue

and database [82].

Figure 15: Nova subcomponents

45

Nova-api is responsible of accepting and fulfilling the API requests. A request consists of

actions that will be performed by nova subcomponents. In order to accept an API request,

nova-api provides an endpoint for all API queries and enforcing some policies. If the request

is about managing virtual machines, the nova-compute is involved to be in charge of creating

or terminating a virtual machine instances. Normally, nova-compute receives requests from

the queue sub-component. In order to manage virtual machine instances, nova-compute uses

different ways and drivers such as libvirt software package, Xen API, vSphere API, etc. to

support virtualization technologies. To specify where to send a request, nova-scheduler

retrieves the request from the queue and determines which compute server host it should run

on. In case there is a need for memory space, nova-volume does the creation, attachment, and

detachment of persistent volumes to virtual machine instances [82].

Nova also provides network management by its subcomponent nova-network. The latter

accepts networking tasks from the queue and then performs system commands to manipulate

the network. Nova-network defines two types of IP addresses: Fixed IPs and Floating IPs.

Fixed IP is considered as a private IP that is assigned to an instance during its life cycle. On

the other hand, floating IP is considered as a public IP that will be used for external

connectivity. The network itself that is defined in nova-compute can be classified into three

categories: Flat, FlatDHCP and VLAN network [82].

Flat assigns a fixed IP address to an instance and attaches that IP on common bridge

(created by the administrator).

FlatDHCP builds upon the Flat manager by providing DHCP services to handle

instance addressing and creation of bridges.

VLAN provides a subnet, and a separate bridge for each project. The range of IPs of a

given project is only accessible within the VLAN.

The last subcomponents of nova are queue and database. Queue is responsible of passing

messages between nova sub-components to facilitate the communication between them. It is

implemented using RabbitMQ. Nova database stores most of the configuration and run-time

state of the cloud infrastructure; it contains a set of tables such as: instance types, instances in

use, networks available, fixed IPs, projects and virtual interfaces [82].

7.3.2. OpenStack Object Storage (Glance)

Glance manages virtual disk images. It consists of three main sub-components glance-api,

glance-registry and glance database (Figure 16). Glance-api accepts incoming API requests

46

and then communicates them to other components (glance-registry and image store). All

information about images is stored in glance-database. Last component which is glance-

registry is responsible of retrieving and storing metadata about images [82].

Figure 16: Glance subcomponents

7.3.3. OpenStack Identity Service (Keystone)

Keystone authorizes users’ access to OpenStack components. It supports multiple forms of

authentication including standard username and password credentials and token-based

systems. Keystone architecture is represented by the following subcomponents (Figure 17):

token backend, catalog backend, policy backend and identity backend [82].

Figure 17: Keystone subcomponents

7.3.4. OpenStack Object Store (Swift)

Swift is the oldest project within OpenStack, and it is the underlying technology that powers

Rackspace’s Cloud Files service [82]. Swift provides a massively scalable and redundant

object store by writing multiple copies of each object to multiple and separated storage

47

servers as to handle failures efficiently. Swift component consists of Proxy Server, Account

Server, Container Server, and Object Server (Figure 18).

Figure 18: Swift subcomponents

Swift-proxy accepts incoming requests that consists of uploading files, making modifications

to metadata and creating containers. Requests are served by account server, container server

or object server. Object servers request about managing pre-existing objects or files in the

storage; account server manages accounts defined with the object storage service, and

container server manages the mapping of containers, folders, within the object store service

[82].

7.3.5. OpenStack Block Storage Service (Cinder)

Cinder allows block devices to be connected to virtual machine instances for better

performance. It consists of the following sub-components: cinder-api, cinder-volume, cinder-

database and cinder-scheduler (Figure 19).

Cinder-api accepts incoming requests and directs them to the cinder-volume which performs

reading or writing to the cinder database to maintain states and interacts with other processes.

Cinder-scheduler is responsible of selecting the optimal block storage node to create the

volume on. In order to maintain communication between cinder components, message queue

is used.

48

Figure 19: Cinder subcomponents

7.3.6. OpenStack Network Service (Quantum)

Quantum allows users to create their own networks and then attach interfaces to them. It

consists of quantum-server, quantum-account, quantum-plugin and quantum-database

(Figure 20). Quantum-server accepts incoming API requests and then directs them to the

correct quantum-plugin. Plugins and agents perform special actions such as plug/unplug ports,

creating networks, subnets and IP addressing. Finally, quantum-database stores networking

state for particular plugins.

Figure 20: Quantum subcomponents

49

7.4.OpenStack Supported Hypervisors

The abstraction feature provided by OpenStack Compute lead to support various existing

hypervisors. Some of the supported hypervisors are listed as follow: KVM, LXC, QEMU,

UML, VMWare ESX/ESXi, Xen, PowerVM, Hyper-V [79]. However, KVM is still the most

widely used hypervisor in deploying OpenStack. Besides KVM, more existing deployments

run Xen, LXC, VMWare and Hyper-V, but each of these hypervisors lack some features

support or the documentation on how to use them with OpenStack is not well documented.

http://www.linux-kvm.org/page/Main_Page

http://lxc.sourceforge.net/

http://wiki.qemu.org/Manual

http://user-mode-linux.sourceforge.net/

http://www.vmware.com/products/vsphere-hypervisor/support.html

http://www.xen.org/

http://www-03.ibm.com/systems/power/software/virtualization/features.html

http://www.microsoft.com/en-us/server-cloud/windows-server/server-virtualization-features.aspx

50

Chapter 8: Hadoop

Hadoop has been adopted by big players in the market such as Google, Yahoo, LinkedIn,

Facebook, New York Times, IBM, etc. [84]. This chapter provides a detailed overview of

Hadoop, starting with a brief history of this open source, the corresponding architecture,

implementation and some related features.

8.1.Hadoop Overview

Hadoop is an Apache Java open source for distributed file system that supports processing,

analyzing and storing large data sets across large clusters using MapReduce paradigm and

HDFS [85]. Hadoop has been designed to be reliable, fault tolerant and scalable project that

can scale up from one single machine to thousands of machines.

8.2.Hadoop History

In 2002, Hadoop was created by Doug Cutting as an open source for web crawling and

indexing, and it was first named Nutch project. Nutch was developed to handle searching

issues, but it faced the scalability problem as it wouldn’t scale up to billions of web pages. To

deal with this issue, Nutch team got inspired by Google’s distributed filesystem (GFS). By

adopting GFS architecture in 2004, Nutch team has delivered an open source called Nutch

Distributed Filesystem (NDFS) [86].

When Google published its paper about MapReduce algorithm, Nutch team has tried to get

advantage of that work by introducing MapReduce to its NDFS system. Implementing both

NDFS and MapReduce made Nutch as a powerful system for web crawling and indexing.

This success has pushed Nutch team to build an independent project in 2006 named Hadoop

project. By this time, Doug Cutting joined Yahoo!, which provided enough resources to

improve Hadoop performance. Even if Yahoo! has developed and contributed to 80% of

Hadoop project, Hadoop was made its own top-level project at Apache in January 2008 [87].

Besides implementing MapReduce and HDFS algorithms, Hadoop project includes other sub-

projects that are listed in Table 6 [85].

51

Table 6: Apache Hadoop subprojects

Hadoop subprojects are grouped and named Hadoop Ecosystem. The overall picture of

Hadoop Ecosystem is illustrated in Figure 21.

HDFS

(Hadoop Distributed File System)

MapReduce (Job Scheduling / Excution System)

Pig (Data Flow) Hive (SQL) Sqoop

ETL Tools BI Reporting RDBMS

HBase

Zookeeper

Avro

Figure 21: Apache Hadoop subprojects [85]

8.3.Hadoop Architecture

Hadoop implements master/slave architecture, where master is named NameNode and slave is

named DataNode. NameNode manages the file system namespace that consists of a hierarchy

of files and directories used for data storage. When a file is created by client application, it is

divided into blocks; each block is replicated and stored in DataNodes. In this case,

information about the replicas numbers (number of block copies) and the mapping of replicas

and blocks are stored in the NameNode. On the other hand, each DataNode is in charge of

52

managing storage attached to the node in which it is running on. Furthermore, each DataNode

handles the read operation, write, block creation, deletion, and replication that come as

instructions from the NameNode [86].

Besides NameNode and DataNodes, Hadoop cluster consists of Secondary NameNode

(backup node for NameNode), JobTracker and TaskTracker. JobTracker is located in the

master node, and it is responsible of distributing MapReduce tasks to other nodes in the

cluster. On the other hand, TaskTracker runs locally tasks distributed by the JobTracker; each

slave in the cluster contains one TaskTracker that can also run on master node [86].

The overall architecture of Hadoop is illustrated in Figure 22.

Figure 22: Hadoop Architecture

8.4.Hadoop Implementation

Hadoop is mainly implemented using HDFS and MapReduce paradigm. HDFS is used to

store large data sets while MapReduce is used to analyze and process data across Hadoop

cluster. Taking into consideration the architecture provided in Figure 22, HDFS concept is

represented by the NameNode, Secondary NameNode and DataNodes, while MapReduce is

represented by the JobTracker and TaskTracker (Figure 23).

53

Figure 23: HDFS and MapReduce representation

8.4.1. HDFS Overview

HDFS is designed as a hierarchy of files and directories. Each file is divided into blocks that

are stored in different DataNodes. NameNode stores only the metadata that includes

information about blocks’ locations and the number of copies of each block. Furthermore,

HDFS allows NameNode to perform the namespace operations such as opening, closing and

renaming files and directories. As stated before, HDFS performs data replication to ensure

fault-tolerance. The replication factor is set when a file is created, and it can be modified later

[85].

An example that illustrates the HDFS process is the read, write and creation operations.

During the read operation, the HDFS request from the NameNode the list of DataNodes that

host replicas of the blocks of a given file. The list is sorted by the network topology distance

from the client. After deciding on the DataNode from where to fetch data, The HDFS client

contacts directly the DataNode and requests the desired block. On the other hand, during the

write operation, the HDFS asks the NameNode to choose DataNodes that will store replicas of

the first block of the file, second block and so on as so far. For each block, the client

organizes a pipeline from node-to-node and sends the data. When the first block is filled, the

client requests new DataNodes to be chosen to host replicas of the next block. Concerning the

creation operation, when there is a request to create a file, the HDFS caches first the file into a

temporary local file. When the latter accumulates data up to the HDFS block size, the HDFS

54

contacts the NameNode to insert the file name into the file system namespace and allocate a

data block for it. After that, the NameNode selects the DataNodes that will host the data

blocks. At this stage, the client moves the block of data from the local temporary file to the

specified DataNode [85].

8.4.2. MapReduce Overview

Hadoop MapReduce is a programming paradigm that processes very large data sets in parallel

manner on large clusters. It was first introduced by Google in 2004 [6]. The core idea of

MapReduce is splitting the input data set into chunks that will be processed by map tasks in a

parallel manner. The output of each map task is sorted to be then directed as an input to the

reduce task. Taking into consideration the previous definition, MapReduce can be classified

into two steps: map step and reduce step [88].

Map task process is divided by itself into five phases: read, map, collect, spill and merge. The

read phase consists of reading the data chunk from the HDFS, and then creating the input

key-value. Map phase is about executing the user-defined map function to generate the map-

output data. Collect phase performs the collection of intermediate (map-output) data into a

buffer before spilling. Spilling process sorts, performs compression, if specified, and writes to

local disk to create file spills. The last step in the map task is the merge phase which merges

all file spills into one single map output file [88] .

Reduce task is also divided into four phases: shuffle, merge, reduce and reduce phase. Shuffle

phase transfers the intermediate data (map output) from the mapper slaves to a reducer's node

and decompressing if needed. Merge phase performs the merging of the sorted outputs that

come from different mappers to be directed as the input to the reduce phase. Reduce phase

executes the user-defined reduce function to produce the final output data. Finally, write

phase compresses, if needed, and writes the final output to HDFS [88] .

A popular example that illustrates the MapReduce execution is the Words Count example

which counts the number of occurrence of each individual word in a given file (Figure 24)

[89].

http://www.cs.duke.edu/starfish/files/hadoop-models.pdf

http://www.cs.duke.edu/starfish/files/hadoop-models.pdf

55

Figure 24: Word count MapReduce example [89]

8.5.Hadoop Cluster Connectivity

When Hadoop starts connecting, each DataNode performs a handshake with the NameNode.

The purpose of this operation is to verify the namespace ID and the software version of the

DataNode. The namespace ID is assigned to the filesystem instance when it is formatted, and

it is stored in all nodes of the cluster. Nodes with a different namespace ID will not be able to

be part of the cluster. However, if the namespace ID is the same, the handshake will be

performed successfully between the DataNodes and the NameNode. At this point, each

DataNode stores its unique storage ID, which is an internal identifier of the DataNode. The

main purpose of this ID is to make the DataNode recognizable even if it is restarted with a

different IP address or port [87].

During normal operation, DataNodes send heartbeats to the NameNode to confirm that the

DataNode is operating and the block replicas it hosts are available. The default heartbeat

interval is three seconds. In case the NameNode does not receive a heartbeat from a DataNode

in ten minutes, the NameNode considers the DataNode as a dead node. In this case,

NameNode creates new replicas of those blocks on dead DataNodes. In fact, heartbeats are

not only used for ensuring NameNode-DataNodes connectivity, but it is also used to send

statistical information such as total storage capacity, and fraction of storage in use. Another

benefit of heartbeats is to send instructions from the NameNode to DataNodes. Those

instructions include commands to replicate blocks to other DataNodes, remove local block

56

replicas, reregister and send an immediate block report, and shut down the node. These

commands are important for maintaining the overall system integrity and therefore it is

critical to keep heartbeats frequent even on big clusters. The NameNode can process

thousands of heartbeats per second without affecting other NameNode operations [87].

57

Part III: Research

Contribution

To clarify the steps we followed in this study, we divided this part into four chapters 9, 10, 11

and 12. Chapter 9 defines the research methodology; chapter 10 describes the experimental

setup that we used to get the performance of HPCaaS; chapter 11 presents the results we got

from each experiment, and finally, chapter 12 discusses and analyzes the research findings.

58

Chapter 9: Research Methodology

The choice of research methodology depends mainly on the nature of the research question.

This chapter discusses the methodology that was followed in conducting the present study. It

explains first the choice of the selected methodology, and then it demonstrates an overall

picture of the research steps.

9.1.Research Approach

The present research was based on a combination of qualitative and quantitative approach.

Qualitative approach was followed to compare and select appropriate technology enablers for

this research (Part III, Chapter 7), whereas quantitative approach was adopted to provide

numeric measurements of HPC on physical cluster and virtualized clusters (Part IV, Chapter

10, 11 and 12),

9.2.Research Steps

Figure 25 summarizes the steps followed in conducting the present research.

Figure 25 : Research steps

59

Chapter 10: Experimental Setup

In order to investigate the research question, we have conducted three main experiments. The

first experiment evaluates the performance of HPC on Hadoop Physical Cluster (HPhC); the

second experiment evaluates the performance of HPC using Hadoop Virtualized Cluster

(HVC) with KVM, and the last experiment evaluates HPC using Hadoop virtualized cluster

with VMware ESXi virtualization technology.

This chapter describes the experiment setup used in this research; it provides an overall

picture of the three adopted clusters; it specifies the hardware, software and network

specifications; it introduces the benchmarks used to evaluate the performance of HPC on each

cluster; it lists the datasets sizes used in each experiment, and finally, this chapter explains the

experimental execution of the present research.

10.1.Experimental Hardware

In our performance study, we have built 3 different clusters: Hadoop Physical Cluster,

Hadoop Virtualized Cluster using KVM and Hadoop Virtualized Cluster using VMware

ESXi. Each cluster is composed of eight machines.

For the physical cluster, we used 8 Dell OptiPlex 755 Desktop computers with specifications

listed in Table 7. For both Hadoop virtualized clusters (KVM and VMware ESXi), we used a

Dell PowerEdge server with features listed in Table 8. On top of the server, we installed

OpenStack to create eight virtual machines using KVM hypervisor and then VMware ESXi

hypervisor. Because of some limited flexibility of OpenStack, we cloud create VMs with

features described in Table 9.

Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster)

60

Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster

Table 9 : OpenStack virtual machines’ features

10.2.Experimental Software and Network

As stated in chapter 6, we opted for Hadoop to process and store small and large datasets; we

chose to install Hadoop version 1.2.1. Concerning OpenStack, the version that was adopted is

Folsom Release which supports KVM, Xen, VMWare and other hypervisors. Networking

configuration was characterized by a bandwidth of 100Mbps per port.

10.3.Clusters Architecture

In this section, we will conceptualize each individual cluster in terms of its layers and

components.

10.3.1. Hadoop Physical Cluster

Figure 26 and 27 show an overall picture of Hadoop Physical Cluster. The configuration was

done in Linux Lab at AUI. The lab is connected to 1 Gbps switch (provides 100 Mbps per

port) that is also connected to other offices in the building (where the lab is allocated). As

61

both figures depict, the cluster contains eight machines where one machine was selected to be

the master and slave node at the same time. The reason behind choosing the master node to

serve as both master and slave node is to increase the cluster performance when processing

and storing datasets.

Figure 26 : Hadoop Physical Cluster

Figure 27: Hadoop Physical Cluster architecture

10.3.2. Hadoop Virtualized Cluster – KVM

The second cluster we built in this research is Hadoop Virtualized Cluster with KVM

technology. As Figure 28 shows, the first step in configuring the cluster is to install an

operating system on Dell PowerEdge server; the OS that was selected is Ubuntu Precise 12.04

62

LTS- 64 bits. The next step is to install and configuring KVM packages which are loaded in

Linux OS as KVM driver. After preparing the system with OS and KVM hypervisor, next

step is to install OpenStack on top of the OS (OpenStack with KVM documentation is

provided in Appendix A). Finally, last step is to configure Hadoop on top of each OpenStack

VM instance (Hadoop documentation is provided in Appendix C).

Figure 28: Hadoop virtualized cluster - KVM

The first OpenStack component that needs to be installed is the keystone which manages the

authentication to OpenStack resources. After downloading and installing the keystone

package, the next step is to create tenants (OpenStack projects) and OpenStack users that are

associated to one or more tenants. Each user can be a member or an admin in a given project;

in this case, roles need to be created in order to set rights and privileges to each user. After

creating users, tenants and roles, next step is to create OpenStack services (nova, keystone,

and glance service) that provide one or more endpoints (URLs) through which users can

access OpenStack resources. The second component to install is OpenStack glance which

allows creating and managing different formats of images (Ubuntu, Fedora, Windows, etc.)

Glance packages include glance-api that accepts incoming API requests; glance-database that

stores all information about images, and finally glance-registry that is responsible of

retrieving and storing metadata about images. Third component to deploy in OpenStack is the

Nova package which includes nova-compute, nova-scheduler, nova-network, nova-

objectstore, nova-api, rabbitmq-server, novnc and nova-consoleauth. All these components

collaborate and communicate with each other to create and manage instances, networks and, if

needed, volumes. Finally, to have access to instances, a user friendly insterface can be

63

installed through configuring OpenStack dashboard or Horizon. After login to OpenStack

Dashboard, the user can launch instances with the possibility of specifying the number of

CPUs, disk space, total RAM memory per VM, etc.

After creating VM instances (with requirements listed in Table 9), we installed Hadoop 1.2.1

on each VM. Hadoop configuration starts with identifying the master node and slave nodes.

For master node, there are six files that need to be configured: core-site, hadoop-env, hdfs,

mapred-site, master and slaves files. Concerning slave nodes, the only files that need to be

configured are hadoop-env, core-site, hdfs and mapred-site files. When connecting nodes, the

cluster needs to be formatted as to clean the file namespace. After formatting Hadoop, the

cluster can be started to run jobs.

10.3.3. Hadoop Virtualized Cluster – VMware ESXi

The third cluster that was built in this research is Hadoop Virtualized Cluster using VMware

ESXi technology (Figure 29). The first step in configuring this cluster is to install VMware

ESXi on top of Dell PowerEdge server. Then, OpenStack is configured on top of the

hypervisor (OpenStack with VMware ESXi documentation is provided in Appendix B). After

configuring OpenStack, instances can be then created to build Hadoop cluster.

Figure 29: Hadoop virtualized cluster – VMware ESXi (a)

In fact, when installing OpenStack with VMware ESXi, Openstack is installed as a VM on top

of VMware ESXi hypervisor. Then, through OpenStack dashboard, instances can be created

as VMs on top of VMware ESXi hypervisor (Figure 30).

64

Figure 30 : Hadoop virtualized cluster – VMware ESXi (b)

10.4.Experimental Performance Benchmarks

To evaluate the impact of machine virtualization on HPCaaS, we adopted two main known

benchmarks: Terasort and TestDFSIO benchmarks [90]. TeraSort performance metrics consist

of measuring the average time to sort certain datasets, while TestDFSIO performance metrics

consist of measuring the execution time to write and read datasets. Table 10 summaries the

performance metrics used in evaluating HPCaaS.

Table 10 : Experimental performance metrics

10.4.1. TeraSort Description

TeraSort was developed by Owen O’Malley and Arun Murthy at Yahoo Inc [90]. It won the

annual general purpose terabyte sort benchmark in 2008 and 2009. It does considerable

computation, networking, and storage I/O, and is often considered to be representative of real

Hadoop workloads [90]. Terasort is divided into three main steps: Teragen, Terasort and

Teravalidate.

65

Teragen generates random data that will be sorted by Terasort. It writes the generated data as

a file of n rows, where each row is 100 bytes. Each row is formatted as follow: 10 bytes key,

10 bytes rowid and 78 bytes filler, where keys are random characters from the set ‘ ‘ .. ‘~’ ,

rowid is an integer that specifies the row id, and filler consists of 7 runs of 10 characters from

‘A’ to ‘Z’. When data is generated, TeraSort sorts this data using quicksort algorithm. The

latter is integrated with map/reduces tasks to use a sorted list of n-1 sampled keys that define

the key range for each reduce [9]. Finally, Teravalidate ensures that the output data of

TeraSort is sorted. It creates one map task per file in TeraSort’s output directory; in this case,

each map task ensures that each key is less than or equal to the previous one. Furthermore,

map task generates records with the first and last keys of the file; then the reduce tasks

ensures that the first key of file i is greater than the last key of file i−1. If there is any

unordered keys, Teravalidate reports this as an output of the reduce task [90]. (TeraSort

benchmark is documented in Appendix D)

10.4.2. TestDFSIO Description

TestDFSIO benchmark is used to check the I/O rate of Hadoop cluster with write and read

operations. Such benchmark can be helpful for testing HDFS by checking network

performance, and testing hardware, OS and Hadoop setup [90]. TestDFSIO is written in Java,

and its source code can be found in [91]. TestDFSIO is composed of TestDFSIO-Write and

TestDFSIO-Read. Both operations are performed by specifying the number of files and the

size of each file in megabyte [90]. (TestDFSIO benchmark is documented in Appendix D)

10.5 Experimental Datasets Size

In each experiment, we measured the performance of Hadoop cluster using different dataset

sizes. For TeraSort, we used 100 MB, 1 GB, 10 GB and 30 GB datasets, and for TestDFSIO,

we used 100 MB, 1 GB, 10 GB and 100 GB datasets. Table 11 summarizes the dataset sizes

used in this research.

Table 11 : Datasets size used for Hadoop benchmarks

66

10.6 Experiment Execution

We started conducting each experiment by scaling the cluster from three machines up to eight

machines. In other words, we test each benchmark on three machines, four machines… until

we reached eight machines. Furthermore, for each individual benchmark, we performed three

tests on 100MB, 1GB, 10 GB and 30 GB (TeraSort) and 100MB, 1GB, 10 GB and 100 GB

(TestDFSIO), then we calculated the mean to avoid any outliers and to provide more accurate

results. Figure 31 simplifies the steps of running experiment 1 on HPhC using Terasort

benchmark.

Figure 31 : Experimental execution

67

Chapter 11: Experimental Results

This chapter presents the findings we got from running each experiment. It presents the results

of running HPC on HPhC; on HVC with KVM, and then the results of running HPC on HVC

using VMware ESXi. Last section, compares the results we got from running each

experiment. (The results we got from running experiments are listed in Appendix E and F)

11.1.Hadoop Physical Cluster Results

11.1.1. TeraSort Performance on HPhC

Running TeraSort benchmark showed that it needs much time to sort large datasets like 10

GB and 30 GB. Yet, scaling the cluster to more nodes led to significant time reduction in

sorting datasets. The results we got from running this benchmark on Hadoop Physical Cluster

are listed in Table 12 and conceptualized in Figure 32.


number of nodes- Hadoop Physical Cluster

Figure 32: TeraSort performance on Hadoop Physical Cluster

68

Figure 33 and 34 illustrate clearly the benefit of scaling the cluster. For instance, running

100MB with 3 nodes needs around 21.33 seconds, while with 8 nodes, it needs 19.97 seconds

(reduced by 6%). In the case of 1GB, the average time was reduced by 4% when scaling from

3 to 8 nodes.

Figure 33: TeraSort performance for

100 MB on Hadoop Physical Cluster Figure 34 : TeraSort performance for 1 GB on

Hadoop Physical Cluster

Concerning 10GB, the results were somehow different (Figure35). Sorting 10 GB was reduced

by 18.55% when scaling from 3 to 6 machines. Yet, increasing the number of machines to 8

nodes led to significant reduction in sorting performance. This can be explained by the impact

of network bottleneck, especially that Hadoop is highly influenced by this issue. Furthermore,

the impact of 8 nodes was important when running large datasets like 30 GB (Figure 36). For

this case, the average time to sort the dataset was reduced by 24.77% (difference of 42

minutes) when increasing the number of nodes from 3 to 8.


10 GB on Hadoop Physical Cluster Figure 36 : TeraSort performance for 30 GB on

Hadoop Physical Cluster

69

11.1.2. TestDFSIO- Write Performance on HPhC

Running TestDFSIO-Write on Hadoop physical cluster follows in general one pattern.

Meaning, as the number of VMs increases, the average time decreases when writing different

dataset sizes. Table 13 and Figure 37 list and illustrate the results we got from running

TestDFSIO-Write on HPhC.


different number of nodes- Hadoop Physical Cluster

Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster

Zooming on TestDFSIO-Write for 100MB dataset (Figure 38), the average time for running

TestDFSIO-Write decreased as the number the of slaves increases. In this case, scaling the

cluster from 3 machines (including the master) to 8 machines led to a reduction of 11.25% in

overall writing average time. The same observation is applied when running TestDFSIO-

Write for 1GB dataset (Figure 39) where the average time was reduced by 46.5 % when

scaling from 3 to 8 slaves.

70

Figure 38: TestDFSIO-Write performance for

1 GB on Hadoop Physical Cluster Figure 39 : TestDFSIO-Write performance

for 100 MB on Hadoop Physical Cluster


10 GB on Hadoop Physical Cluster Figure 41 : TestDFSIO-Write performance

for 100 GB on Hadoop Physical Cluster

When running 100 GB (Figure 41), we observe a sharp time reduction in running the

TestDFSIO-Write when scaling from 3 to 8 slaves; this reduction was quantified by 12.53%.

However, an expected average time was increased when scaling from 4 to 5 machines. Again,

this unexpected result can be explained by the overall network performance.

11.1.3. TestDFSIO- Read Performance on HPhC

Running TestDFSIO-Read led also to significant performance improvement when the

physical cluster was scaled up to 8 machines (Table 14 and Figure 42). In general, this

observation is applied for all dataset sizes.

71

Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and

different number of nodes- Hadoop Physical Cluster

Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster

When the cluster was scaled from 3 to 7 nodes, the average time for reading 100MB (Figure

43) was reduced by 4.36% and 2.46% when reading 1GB (Figure 44). However, when scaling

the cluster from 7 to 8 machines, the average time increased suddenly when reading both

100MB and 1GB. The same observation was made when reading 10GB and 100GB (Figure

45 and 46).

Figure 43: TestDFSIO-Read performance for

100 MB on Hadoop Physical Cluster Figure 44 : TestDFSIO-Read performance for 1

GB on Hadoop Physical Cluster

72

Figure 45: TestDFSIO-Read performance for

10 GB on Hadoop Physical Cluster Figure 46 : TestDFSIO-Read performance for

100 GB on Hadoop Physical Cluster

11.2.Hadoop Virtualized Cluster- KVM Results

11.2.1. TeraSort Performance on HVC-KVM

Running TeraSort on Hadoop KVM Cluster showed an important improvement in sorting

various dataset sizes. Yet, this observation is applied when scaling the KVM cluster from 3 to

5 VMs. The results we got from running this benchmark on Hadoop KVM Cluster are listed

in Table 15 and conceptualized in Figure 47.


number of nodes- Hadoop KVM Cluster

Figure 47: TeraSort performance on Hadoop KVM Cluster

73

From Figure 48, sorting 100MB on 3 VMs takes around 15 seconds, and it decreases by 2.2%

and 5.5% when sorting the dataset on 4 and 5 VMs respectively.


100 MB on Hadoop KVM Cluster Figure 49 : TeraSort performance for 1 GB on

Hadoop KVM Cluster

When sorting 1GB, 10 GB and 30 GB (Figure 49, 50 and 51), the performance was slightly

improved as the number of VMs increases. For example, sorting time of 10GB was decreased

by 0.3%, and sorting time of 30 GB was decreased by 5% when scaling from 3 to 4 nodes.

However, when the cluster was scaled to 5, 6, 7 and 8 nodes, the overall performance of

sorting 1GB, 10 GB and 30 GB was sharply decreased.


10 GB on Hadoop KVM Cluster Figure 51 : TeraSort performance for 30 GB on

Hadoop KVM Cluster

74

11.2.2. TestDFSIO-Write Performance on HVC-KVM

Running TestDFSIO-Write on Hadoop KVM was slightly improved as the number of VMs

increases. The results of running TestDFSIO-Write are listed in Table 16 and illustrated in

Figure 52.


different number of nodes- Hadoop KVM Cluster

Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster

For all dataset sizes (Figure 53, 54, 55 and 56), as stated before, the overall performance was

slightly improved as the number of VMs increased from 3, 4 and 5. For instance, writing

10GB was improved by 1.6% when scaling from 3 to 5 VM. Furthermore, when trying to

write 100GB, the system was crashed because of the overall system overhead (Figure 56).

75

Figure 53: TestDFSIO-Write performance for 100

MB on Hadoop KVM Cluster Figure 54 : TestDFSIO-Write performance for

1GB on Hadoop KVM Cluster


10 GB on Hadoop KVM Cluster Figure 56 : TestDFSIO-Write performance for

100 GB on Hadoop KVM Cluster

11.2.3. TestDFSIO- Read Performance on HVC-KVM

TestDFSIO- Read has the same behavior as TestDFSIO-Write. Meaning, the performance of

reading different dataset sizes increases as the number of VMs increases from 3 to 5. The

results we got from running TestDFSIO- Read are illustrated in Table 17 and Figure 57.

76


different number of nodes- Hadoop KVM Cluster

Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster

As Figure 58, 59, 60 and 61 depict, the overall performance of reading different dataset sizes

increases as the number of VMs increases from 3 to 5. For example, the average time for

reading 100GB was slightly decreased by 3% when scaling from 3 to 5 VMs.

Figure 58: TestDFSIO-Read performance for 100

MB on Hadoop KVM Cluster Figure 59 : TestDFSIO-Read performance for

1GB on Hadoop KVM Cluster

77

Figure 60: TestDFSIO-Read performance for 10

GB on Hadoop VMware ESXi Cluster Figure 61 : TestDFSIO-Read performance for

100 GB on Hadoop VMware ESXi Cluster

11.3.Hadoop Virtualized Cluster- VMware ESXi Results

11.3.1. TeraSort Performance on HVC-VMware ESXi

Table 18 and Figure 62 present the performance of running TeraSort on Hadoop VMware

ESXi Cluster; the overall observation shows significant improvement in sorting various

dataset sizes. In contrast to KVM cluster, VMware ESXi keeps decreasing the average time

of storing as the number of VMs increases from 3 to 6 (for large datasets).


different number of nodes- Hadoop VMware ESXi Cluster

Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster

78

As Figure 63 depicts, the performance of sorting 1 GB was decreased by 23% when scaling

the cluster from 3 to 6 VMs. Yet, the performance starts degrading as the number of VMs

increases from 6 to 7 and 8.

Figure 63: TeraSort performance for 100 MB on

Hadoop VMware ESXi Cluster Figure 64 : TeraSort performance for 1GB on

Hadoop VMware ESXi Cluster

A significant high performance was observed when sorting 30GB (Figure 66). The

performance was increased by 34% from 3 to 6 VMs, 25% from 3 to 7 VMs and 3% from 3 to

8 VMs.

Figure 65: TeraSort performance for 10 GB on

Hadoop VMware ESXi Cluster Figure 66 : TeraSort performance for 30GB

on Hadoop VMware ESXi Cluster

79

11.3.2. TestDFSIO-Write Performance on HVC-VMware ESXi

Running TestDFSIO-Write on Hadoop VMware ESXi was improved as the number of VMs

increases to 7. The results of running TestDFSIO-Write are listed in Table 19 and illustrated

in Figure 67.

Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and


Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster

For all dataset sizes (Figure 68, 69, 70 and 71), the overall performance was improved as the

number of VMs increases from 3 to 7. For instance, writing 100 MB was improved by 37%

when scaling from 3 to 7 VMs. Furthermore, when writing large dataset like 10GB, the

overall performance increased by 12% when scaling from 3 to 7 VMs. However, for the case

of 100GB, the performance started degrading when scaling from 6 to 7 and 8 VMs.

80


MB on Hadoop VMware ESXi Cluster Figure 69 : TestDFSIO-Write performance for

1GB on Hadoop VMware ESXi Cluster


GB on Hadoop VMware ESXi Cluster Figure 71 : TestDFSIO-Write performance for


11.3.3. TestDFSIO- Read Performance on HVC-VMware ESXi

TestDFSIO- Read behaves as TestDFSIO- Write when the performance of reading different

dataset sizes increases as the number of VMs increases from 3 to 7. However, the average

time for reading different datasets was less than writing operation (by more than half). The

results we got from running TestDFSIO- Read on VMware ESXi are listed in Table 20 and

conceptualized in Figure 72.

81



Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster

Figures 73, 74, 75 and 76 show the performance of running TestDFSIO-Read on each

individual dataset . For most dataset sizes, the performance was improved as the number

of VMs inreased up to 7. For instance, the performance of reading 100GB was improved

by 36% when scaling from 3 to 7 VMs. However, reading 1GB behavied differently as the

correspondding performance started to decline at VM 6.

Figure 73: TestDFSIO- Read performance for

100 MB on Hadoop VMware ESXi Cluster Figure 74 : TestDFSIO-Read performance for 1

GB on Hadoop VMware ESXi Cluster

82

Figure 75: TestDFSIO- Read performance for

10 GB on Hadoop VMware ESXi Cluster Figure 76 : TestDFSIO-Read performance for


11.4. Results Comparison

11.4.1. TeraSort Performance

The overall performance of the 3 clusters varies depending on the datasets size and the

number of nodes involved in each cluster. Yet, Hadoop VMware ESXi cluster was performing

much better than other clusters when running TeraSort benchmark on large datasets.

Starting with 100MB (Figure 77), TeraSort showed high performance when being virtualized

with VMware ESXi and KVM. Both clusters were 25% (VMware ESXi) and 30% (KVM)

faster than the physical cluster (in case of 3 nodes). Further, a significant performance was

achieved when scaling the cluster to 4, 5 and 6 nodes; in this case, both KVM and VMware

ESXi were faster than the physical cluster.

After increasing the number of nodes to 7 and 8, VMware ESXi performance decreases by

33% and becomes slower than the physical cluster by 18% (when scaling from 3 to 8 nodes).

On the other hand, the average time of sorting 100MB dataset on KVM cluster declined as the

number of nodes increases to 7 and 8, and therefore, the sorting performance was improved

from 15 to 14 seconds. Further, virtualized cluster (KVM) was performing better than the

physical cluster by 29.5% and 27% for 7 and 8 nodes respectively.

83

Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and

VMware ESXi

When increasing the dataset size, the performance changes in each scenario (dataset size and

number of nodes). In the case of 1GB (Figure 78), virtualized cluster was keeping the best

performance when compared with the physical cluster. When the cluster was composed of 3-5

nodes, virtualized clusters sort the 1GB dataset with a range of 87-90 seconds, while the

physical cluster sorts the same dataset with a range of 182-187 seconds. When increasing the

number of nodes from 5 to 8, VMware ESXi was faster than other clusters; however, KVM

knew a decline in its performance when being compared with KVM cluster of 3-4 nodes and

when being compared with the physical cluster. For instance, in the case of 8 machines,

physical cluster was faster than KVM cluster by 89%.

Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi

84

The same observation on 1GB can be applied when sorting 10GB dataset (Figure 79). Yet, in

this case, the performance of virtualized clusters was very high than the physical cluster. For

instance, in the case of 5 VMs, VMware ESXi cluster was faster than physical cluster by 60%,

and KVM was faster than physical cluster by 51%.

Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and

VMware ESXi

When moving to larger datasets, VMware ESXi cluster proved its significant performance in

sorting the 30 GB dataset (Figure 80). For instance, in the case of 4 nodes, VMware ESXi

was faster than KVM cluster by 28% and faster than physical cluster by 61%. Moreover,

KVM was performing better than physical cluster when the cluster was composed of 3, 4, 5

and 6 nodes. Afterward, when increasing the cluster size to 7 and 8 nodes, the KVM cluster

decreased in its performance and became slower than the physical cluster.

Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and

85

VMware ESXi

The last observation consists in VMware ESXi performance on 8 nodes cluster. For all

different datasets, we observed that VMware performance degraded; for example, for 10 GB,

the performance decreased by 51%. Even though, VMware ESXi kept performing better than

other clusters.

11.4.2. TestDFSIO- Write Performance

The results we got from TestDFSIO were different than the ones in TeraSort benchmark. The

overall observation of Figure 81 and 82 shows that virtualization is still performing better

than the physical cluster.

In the case of 3-5 nodes cluster, we can observe that KVM cluster performance is much

better than VMware ESXi and physical cluster. For instance, writing 100 MB using 5 nodes,

KVM cluster was 11% faster than physical cluster and 24% faster than VMware ESXi cluster

(Figure 81). However, we observed that physical cluster was performing better than VMware

ESXi, and the difference was quantified by 48% seconds (100 MB using 5 nodes).

When scaling the cluster from 5 to 8 nodes, the KVM cluster knew sharp performance

degradation. Again, this is due to system overhead. In this case, the physical cluster showed

better results than virtualized clusters.

Figure 81: Average time for writing 1 GB on

HPhC, HVC with KVM and VMware ESXi Figure 82 : Average time for writing 10 GB on

HPhC, HVC with KVM and VMware ESXi

The same observation is applied when sorting 100 GB (Figure 83). The only difference is that

KVM cluster with 8 nodes was unable to write the 100 GB.

86

Figure 83: Average time for writing 10 GB on

HPhC, HVC with KVM and VMware ESXi Figure 84 : Average time for writing 100 GB on

HPhC, HVC with KVM and VMware ESXi

11.4.3. TestDFSIO- Read Performance

As illustrated in Figure 84 and 85, reading small datasets (100MB and 1GB) showed that

virtualized cluster is faster than physical cluster. Yet, this applied for KVM cluster when it is

composed of 3-5 nodes. Afterwards, when KVM clusters scaled to 6, 7 and 8 nodes, the

performance of reading all datasets degraded. On the other hand, physical cluster performed

better than VMware ESXi in all case (100MB and 1G on different number of nodes).

Figure 85: Average time for reading 1 GB on HPC,

HVC with KVM and VMware ESXi Figure 86 : Average time for reading 1 GB on

HPC, HVC with KVM and HVC VMware ESXi

When increasing the dataset size to 10 GB and 100GB (Figure 86 and 87), we can see

different performance trends. When the cluster is composed of 3-5 nodes, KVM cluster kept

better performance than other clusters. For instance, for 100 GB and 3 nodes, KVM cluster

87

was faster than VMware ESXi by 12% and faster than physical cluster by 44%. However, as

other benchmarks (TeraSort and TestDFSIO-Write), KVM cluster showed a sharp

degradation in reading 100GB when the cluster scaled to 6, 7 and 8 nodes. When reading

10GB and 100 GB, in contrast to TestDFSIO-Write results, VMware ESXi cluster was faster

than physical cluster in all scenarios (number of nodes). For instance, using a cluster of 3

nodes; VMware ESXi was faster than the physical cluster by 36% and 55.5% in the case of 7

and 8 nodes respectively.

An important observation is that KVM cluster with 8 VMs was unable neither to write nor to

read 100GB dataset (Figure 87).

Figure 87: Average time for reading 10 GB on

HPhC, HVC with KVM and HVC VMware ESXi Figure 88 : Average time for reading 100 GB on

HPhC, HVC with KVM and HVC VMware ESXi

88

Chapter 12: Discussion

The results we got in this research proved significant improvements when virtualizing HPC,

especially when the latter was tested with TeraSort benchmark; in this case, we found that

both virtualized clusters (KVM and VMware ESXi) have better performance than physical

cluster.

12.1.TeraSort Performance

When running TeraSort benchmark, VMware ESXi cluster proved to have fast sorting of

large datasets starting from 1GB, 10 GB and 30 GB. For instance, sorting 30GB using a

cluster of 4 nodes showed that VMware ESXi is faster than KVM by 64% and faster than

physical cluster by 84% (Figure 80). Concerning the KVM cluster, it was also proved to be

faster than the physical cluster. However, when the number of nodes increases in virtualized

clusters, the performance of TeraSort degraded significantly.

In the case of KVM cluster, when the number of nodes increases to 6, 7 and 8, the overall

performance of running TeraSort became slower. In fact, the reason behind facing this

degradation is explained by the system overhead, especially disk overhead. A study was done

in [92] performed an analysis of KVM scalability in OpenStack platform, and it state that

KVM is not recommended to be used when many virtual hard disks will be accessed at the

same time. Therefore, since TeraSort has both computational and I/O jobs, KVM VMs

affected the overall performance when they were scaled to 6, 7 and 8. Moreover, another

study was done in [93] states that KVM has substantial problems with guests crashing when it

reaches a certain number of VMs (4 for this study [93]); hence, scalability is considered an

issue for system overhead when using KVM virtualization.

In the case of VMware ESXi cluster, the performance of running TeraSort declines when the

cluster was scaled to 8 nodes. The same as KVM, the reason is due to system overhead.

However, the system overhead is not related to scalability issue because VMware ESXi is

known to be scalable [94]. To make sure from the cause that led to system overhead, we

tracked the performance of sorting 30GB dataset on 8 VMware ESXi VMs (using VMware

vSphere Client), and we found that, at some point, the memory required to sort the dataset

exceeds the available memory offered by the cluster. This can be observed in Figure 88 which

illustrates that active memory (in red, memory currently consumed by VMs) is higher than the

granted memory (in grey, memory provided by the hosting hardware) between 5:05 and 5:10

89

PM range. Another proof that confirms the system overhead is the latency rate; in this case,

we tracked the latency of running 30 GB on 8 VMs, and we observed that system latency

reaches its peak (Figure 89) when sorting this dataset. Thus, latency impacts the overall

performance when the number of VMs increases to 8. The last proof was reported by

OpenStack Dashboard (Figure 90) which showed warning state of resources usage after

creating 8 VMware ESXi instances. In short, VMware ESXi cluster performance declines at 8

VMs because of resources shortage.

Figure 89: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs

Figure 90 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi

VMs

90

Figure 91: OpenStack warning statistics about system’ resources usage

In short, Even if TeraSort performance decreases when the number of VMware ESXi VMs

increases to 8, the results we got still confirm that Hadoop VMware ESXi cluster is better

than Hadoop KVM Cluster and Hadoop Physical Cluster.

12.2.TestDFSIO Performance

The performance behavior of each cluster changed when running TestDFSIO benchmark. For

all dataset sizes, KVM cluster proved to have high performance than other clusters when

performing both TestDFSIO-Write and TestDFSIO-Read (Figures 81-87). On the other hand,

VMware ESXi showed the lowest performance when compared to KVM and physical cluster.

In fact, the reason that explains the good results we got from running TestDFSIO on KVM is

related to virtio API. The latter is integrated in KVM hypervisor to provide an efficient

abstraction for I/O operations [95]. Virtio was studied in [96] and proved that it enhances

KVM performance at I/O operations; the authors of this study [96] tested the performance of

KVM (with virtio API) at I/O operations and compared it with VMware vSphere 5.1

performance. They concluded that KVM with virtio API achieves I/O rates that are 49%

higher than VMware vSphere 5.1.

When running TestDFSIO, we observed again that the performance of both virtualized

clusters decreases as the number of VMs goes beyond 6 (KVM) and 7 (VMware ESXi).

91

12.3.Conclusion

Brief, the overall performance of TeraSort and TestDFSIO proved that, first, virtualization has

better performance than physical cluster, and, second, the selection of underlying

virtualization technology can lead to significant improvements when performing HPCaaS.

Therefore, in this research, VMware ESXi proved to have the best performance especially

when running computational jobs (TeraSort).

To deal with the issue of system overhead in virtualized clusters, HPCaaS needs to be run in a

cloud environment that has balanced number of VMs. For this research, the reasonable

number that provides high performance was 7 VMs for VMware ESXi and 5 VMs for KVM

cluster.

92

Part IV: Conclusion

This part summarizes the research objectives and findings and suggests some related future

work. Bibliography of this report is listed after the conclusion, and finally, a set of appendices

(OpenStack Documentation, Hadoop Documentation, Benchmarks Execution and Data

Gathering) are provided at the end of this report.

93

Chapter 13

Conclusion and Future Work

This project aimed at demonstrating the impact of running HPCaaS on different virtualization

technologies, namely, KVM and VMware ESXi cluster.

For that, we have built three main Hadoop clusters: Hadoop Physical Cluster, Hadoop

Virtualized Cluster with KVM and Hadoop Virtualized Cluster with VMware ESXi. For

virtualized clusters, we proposed to build Hadoop cluster on top of OpenStack platform. On

each cluster, we run two known benchmarks: TeraSort and TestDFSIO. Each benchmark was

tested on different dataset sizes and on different number of machines (from 3 to 8 machines).

To ensure the credibility and reliability of the research, we performed three tests on each

scenario; for instance, we tested TeraSort for 30GB on each cluster three times, and then we

took the mean to avoid any outliers.

The findings of this research clearly demonstrate that vitalized clusters can perform much

better than physical cluster when processing and handling HPC, especially when there is less

overhead on the virtualized cluster. We found that Hadoop VMware ESXi cluster performs

better at sorting big datasets (more computations), and Hadoop KVM cluster performs better

at I/O operations.

Finally, this report includes detailed installation guides of OpenStack and Hadoop that will

save time and facilitate the work for future students who want to work on related research.

As future work, the possibilities for extending this research can go in different directions. The

first proposed work is to conduct the research’ experiments using real HPC applications that

can show precisely the impact of virtualization on HPCaaS. The second proposed future work

is to conduct this research using other emerging virtualization technologies such as XEN, and

Hyper-V. Third proposed future work is to see the impact of cloud platforms on improving

the HPCaaS; meaning, another research can be conducted to see for example, if replacing

OpenStack with another cloud infrastructure can lead to better results. Finally, since we got

positive results about the impact of visualization on HPCaaS, this research can be investigated

more by integrating its findings in other domains such as Smart Grid.

94

Bibliography

[1] J. Gantz and D. Reinsel, “The Digital Universe in 2020: Big Data, Bigger Digital

Shadows, and Biggest Growth in the Far East”, IDC IVIEW, pp. 1-16, 2012

[2] Gartner, Inc., “Hunting and Harvesting in a Digital World”, in Gartner CIO Agenda

Report, pp. 1-8, 2013

[3] Amazon Web Services, “High Performance Computing (HPC) on AWS”,

http://aws.amazon.com/hpc-applications/

[4] J. Gantz and D. Reinsel, “The Digital Universe Decade – Are You Ready?”, IDC IVIEW,

pp. 1-15, 2010

[5] Ch.Vecchiola1, S. Pandey, R.Buyya, “High-Performance Cloud Computing: A View of

Scientific Applications”, in the 10th International Symposium on Pervasive Systems,

Algorithms and Networks I-SPAN, IEEE Computer Society, pp. 4-16, 2009

[6] J. Dean and S. Ghemawat, “MapReduce: Simple Data Processing on Large Clusters”, in

OSDI, pp. 1-12. 2004

[7] Hadoop: http://hadoop.apache.org/

[8] S. Krishman, M. Tatineni, and C. Baru, “MyHaddop – Hadoop-on-Demand on Traditional

HPC Resources”, in the National Science Foundation’s Cluster Exploratory Program, pp.

1-7, 2011

[9] E. Molina-Estolano, M. Gokhale, C. Maltzahn1, J. May, J. Bent, S. Brandt, “Mixing

Hadoop and HPC Workloads on Parallel Filesystems”, in the 4th Annual Workshop on

Petascale Data Storage, pp. 1-5, 2009

[10] C. Cranor, M. Polte, and G. Gibson, “HPC Computation on Hadoop Storage with

PLFS”, Parallel Data Laboratory at Carnegie Mellon University, pp. 1-9, 2012

[11] Y. Xiaotao, L. Aili, and Z. Lin, “Research of High Performance Computing with

Clouds”, in the Third International Symposium on Computer Science and Computational

Technology (ISCSCT), pp. 289-293, 2010

[12] KVM:http://www.linux-kvm.org/page/Main_Page

[13] VMware ESXi: http://www.vmware.com/

[14] D. Boulter, “Simplify Your Journey to the Cloud”, Capgemini and SOGETI, pp. 1-

8, 2010.

[15] P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, National Institute of

Standards and Technology, pp. 1-3, 2011

[16] A. E. Youssef, “Exploring Cloud Computing Services and Applications”, Journal of

Emerging Trends in Computing and Information Sciences, vol. 3, no. 6, pp. 838-

847, 2012

[17] T. Korri, “Cloud Computing: Utility Computing over the Internet”, Seminar on

http://aws.amazon.com/hpc-applications/

http://hadoop.apache.org/

http://www.vmware.com/

95

Internetworking, pp. 1-5, 2009

[18] ISACA, “Cloud Computing: Business Benefits with Security, Governance and Assurance

Perspectives”, pp. 1-10, 2009

[19] A. T. Velte, T. J. Velte, R. C. Elsenpeter,”Cloud Computing, A practical approach”,1st

ed., USA : McGraw-Hills, 2009

[20] Amazon Web Services: http://aws.amazon.com/

[21] Google Cloud Platform: https://cloud.google.com/

[22] Microsoft Cloud Services: http://www.microsoft.com/enterprise/it- trends/cloud-

computing/default.aspx?Search=true#fbid=33S2kMNT99z

[23] Open Source Software for Building Private and Public Clouds:

http://www.openstack.org

[24] I. Menken, and G. Blokdijk, “Cloud Computing Virtualization Specialist Complete

Certification Kit - Study Guide Book and Online Course”, Emereo Pty Ltd, 2009

[25] M. Portnoy, Virtualization Essentials, John Wiley & Sons, 2012

[26] K. Scarfone, M. Souppaya, and P. Hoffman, “Guide to Security for Full Virtualization

Technologies”, National Institute of Standards and Technology, 2011

[27] D. Dale, “Server and Storage Virtualization with IP Storage”, Storage Networking

Industry Association (SNIA), 2008

[28] D. Marinescu and R. Kroger; “State of the Art in Autonomic Computing and

Virtualization”, Wiesbaden University of Applied Sciences, pp. 1-21,2007

[29] K. Koganti, E. Patnala2, S. Narasingu, J. Chaitanya,Virtualization Technology in Cloud

Computing Environment, in International Journal of Emerging Technology and Advanced

Engineering, vol. 3, no. 3, 2013

[30] N. Susanta and T. Chiueh, “A Survey on Virtualization Technologies”, Department of

Computer Science at Stony Brook, 2006

[31] Virtualization: A Key to Virtualization World: http://isa.unomaha.edu/wp-

content/uploads/2012/08/Virtualization.pdf

[32] “Virtualization Overview”, white paper, VMware, 2006

[33] N. Alam, “Survey on Hypervisors”, School of Informatics and Computing at Indiana

University, 2011

[34] C. D. Graziano, “A Performance Analysis of Xen and KVM Hypervisors for Hosting the

Xen Worlds Project”, Digital Repository at Iowa State University, pp. 12-39, 2011

[35] N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM”, Master

Thesis, pp. 30-44, 2012

[36] “How Does Xen Work?”, white paper, Xen, 2009

[37] O. Kulkarmi, N. Xinli, and P. K. Swamy, “Cutting-Edge Perspective of Security

Analysis for Xen Virtual Machines”, International Journal of Engineering Research and

http://aws.amazon.com/

https://cloud.google.com/

http://www.microsoft.com/enterprise/it-%20trends/cloud-computing/default.aspx?Search=true#fbid=33S2kMNT99z

http://www.openstack.org/

http://www.amazon.com/s/ref=ntt_athr_dp_sr_1?_encoding=UTF8&field-author=Matthew%20Portnoy&search-alias=books&sort=relevancerank

http://isa.unomaha.edu/wp-

96

Development, vo. 2, no. 3, pp. 40-45, 2012

[38] T. Hirt, “KVM – The Kernel-based Virtual Machine”, Red Hat Inc., 2010

[39] M. T. Jones, “Anatomy of a Linux Hypervisor”, IBM Corporation, 2009

[40] “VMware ESXi 5.0 Operations Guide”, white paper, VMware, 2011

[41] M. K. Kakhani, S. Kakhani, and S. R. Biradar, “Research Issues in Big Data Analytics,”

Vol. 2, No. 8, pp. 228–232, 2013

[42] C. Hagen, “Big Data and the Creative Destruction of Today’s”, ATKearney, 2012

[43] “Oracle : Big Data for the Enterprise”, white paper, Oracle Corp., 2013

[44] “Oracle NoSQL Database”, white paper, Oracle Corp., 2011

[45] S. Yu, “ACID Properties in Distributed Databases”, Advanced eBusiness Transactions

for B2B-Collaborations, 2009

[46] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available,

partition-tolerant web services,” ACM SIGACT News, vol. 33, no. 2, p. 51, 2002

[47] A. Lakshman, P. Malik, “Cassandra - A Decentralized Structured Storage System”, ACM

SIGOPS Operating Systems Review, vol. 44, no.2, pp. 35-40, 2010

[48] G. Lars., “Introduction,” in HBase: The Definitive Guide, USA: O'Reilly Media, 2011

[49] MongoDB: http://www.mongodb.org/

[50] Apache CouchDB™: http://couchdb.apache.org/

[51] J.Bernstein, K. McMahon, “Computing on Demand—HPC as a Service: High

Performance Computing for High Performance Business”, white paper, Penguin Computing

& McMahon Consulting.

[52] Y. Xiaotao, L. Aili, Z. Lin, “Research of High Performance Computing With Clouds,”

International Symposium Computer Science and Computational Technology, pp. 289–

293, 2010

[53] Self-service POD Portal: http://www.penguincomputing.com/services/hpc-

cloud/pod

[54] Amazon Cloud Storage: http://aws.amazon.com/ec2/reserved-instances/

[55] Amazon Cloud Drive: http://aws.amazon.com/ec2/spot-instances/

[56] Microsoft High Performance Computing for Developers:

http://msdn.microsoft.com/en-us/library/ff976568.aspx

[57] Google Cloud Storage: https://cloud.google.com/products/compute-engine

[58] S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, “Case Study for Running HPC

Applications in Public Clouds”, in Science Cloud '10, 2012

[59] K. R. Jackson, “Performance Analysis of High Performance Computing Applications on

the Amazon Web Services Cloud”, in Cloud Computing Technology and Science

ftp://61.135.158.199/pub/books/HBase The Definitive Guide.pdf

http://www.mongodb.org/

http://couchdb.apache.org/

http://www.penguincomputing.com/services/hpc-cloud/pod

http://aws.amazon.com/ec2/reserved-instances/

http://aws.amazon.com/ec2/spot-instances/

http://msdn.microsoft.com/en-us/library/ff976568.aspx

https://cloud.google.com/products/compute-engine

97

(CloudCom), 2010 IEEE Second International Conference on, pp. 159-168, 2010

[60] E. Walker, “Benchmarking Amazon EC2 for High-Performance Scientific Computing”,

Texas Advanced Computing Center at the University of Texas, pp. 18-23, 2008

[61] J. Ekanayake and G. Fox, “High Performance Parallel Computing with Clouds and Cloud

Technologies”, School of Informatics and Computing at Indiana University, pp. 1-

20, 2009.

[62] Y. Gu and R. L. Grossman, “Sector and Sphere: The Design and Implementation of a

High Performance Data Cloud”, National Science Foundation, pp. 1-11, 2008

[63] A. Gupta and D. Milojicic, “Evaluation of HPC Applications on Cloud”, Helwett-

Packard Development Company, pp. 1-6, 2011

[64] C. Evangelinos and C. N. Hill. “Cloud Computing for parallel Scientific HPC

Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on

Amazon’s EC2”, Department of Earth, Atmospheric and Planetary Sciences at

Massachusetts Institute of Technology, pp. 1-6, 2009

[65] “Dryad and DryadLINQ for Data Intensive Research”:

http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

[66] C. Fragni, M. Moreira, D. Mattos, L. Costa, and O. Duarte, “Evaluating Xen, VMware,

and OpenVZ Virtualization Platforms for Network Virtualization”, Federal University of

Rio de Janeiro, pp. 1-1, 2010

[67] N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM”, Master

Thesis, pp. 30-44, 2012

[68] T. Deshane, M. Ben-Yehuda, A. Shah, and B. Rao, “Quantitative Comparison of Xen

and KVM”, in Xen Summit, pp. 1-3, 2008

[69] J. Hwang, S. Wu, and T. Wood, “A Component-Based Performance Comparison of Four

Hypervisors”, George Washington University and IBM T.J. Watson Research Center, pp.

1-8, 2012

[70] A. J. Younge, R. Henschel, J. T. Brown, G. Laszewski, J. Qiu, and G. C. Fox, “Analysis

of Virtualization Technologies for High Performance Computing Environments”,

Pervasive Technology Institute, pp. 1-8, 2012

[71] Q. Jiang. “Open Source Iaas Community Analysis”, Eucalyptus Systems Inc., 2012

[72] I. Voras, M. Orlic, and B. Mihaljevié, “An Early Comparison of Commercial and Open-

Spurce Cloud P¨latforms for Scientific Environments”, University of Zagreb Faculty of

Electrical Engineering and Computing, Zagreb, Croatia, 2012

[73] E. Caron, L. Toch, and J. Rouzaud-Cornabas, “Performance Comparison between

OpenStack and OpenNebula and the multi-Cloud Architecture: Application to

Cosmology”, Research Report N° 8421, 2013

[74] K. Kostantos, A. Kapsalis, D. Kyriazis, M. Themistocleous, and P. Cunha, “Open-Source

IAAS Fit for Purpose: A Comparison between Openbula and OpenStack”, International

Journal of Electronic Business Management, Vol. 11, No. 3, 2013

http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

98

[75] O. Sefraoui, M. Aissaoui, and M. Eleuldj, “Comparison of Multiple IaaS Cloud Platform

Solutions”, Mohamed I University, 2012

[76] “Donnie Berkholz’s Story of Data3:

http://redmonk.com/dberkholz/2012/03/26/nosql-database-popularity-according-to-

jaspersoft/

[77] E. Dede, M. Govindaraju, D. Gunter, R. Canon, and L. Ramakrishnan, “Performance

Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis”, SUNY

Binghamton and Lawrence Berekely National Lab, 2012

[78] J. H. Lee, “Log Analysis System Using Hadoop and MongoDB”, CUBRID, 2012.

[79] OpenStack: http://www.openstack.org/

[80] “OpenStack Training Guides”, white paper, OpenStack Foundation, 2013

[81] A. Sehgal, “Introduction to OpenStack: Running a Cloud Computing Infrastructure with

Openstack”, in the 6th International Conference on Autonomous Infrastructure,

Management and Security, University of Luxembourg, 2012

[82] K. Pepple, Deploying OpenStack, O'Reilly Median, 2011

[83] OpenStack, “Companies Supporting the OpenStack Foundation”,

http://www.openstack.org/foundation/companies/

[84] G. Sasiniveda and N. Revathi, “Data Analysis using Mapper and Reducer with Optimal

Configuration in Hadoop", International Journal of Computer Trends and Technology,

vol. no. 3, 2013

[85] D. Borthakur, “The Hadoop Distributed File System: Architecture and Design”, The

Apache Software Foundation, 2007

[86] T. White, Hadoop: The Definitive Guide, O'Reilly Media, 2010

[87] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File

System”, Sunnyvale, 2010

[88] H. Herodotu, “Hadoop Performance Models”, Computer Science Department at Duke

University, 2011

[89] Blogclub Tworkshops,”Hadoop and MapReduce”,

http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/

[90] M. G. Noll, “Benchmarking and Stress Testing and Hadoop Cluster with TeraSort, Test

DFSIO & Co.”, 2011

[91] Apache Hadoop, “TestDFSIO Apache Hadoop Code Source”,

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-

mapreduce- client-jobclient/0.23.9/org/apache/hadoop/fs/TestDFSIO.java

[92] F. Rahma*, T. Adji, Widyawan, “Scalability Analysis of KVM-Based Private Cloud For

IaaS”, in International Journal of Cloud Computing and Services Science, Vol.2, No.4,

ppt. 288-295, 2013

http://redmonk.com/dberkholz/2012/03/26/nosql-database-popularity-according-to-%20jaspersoft/

http://www.amazon.com/Ken-Pepple/e/B004QQBWJW/ref=ntt_athr_dp_pel_1

http://shop.oreilly.com/product/0636920010388.do#tab_04_0

http://www.alex-hanna.com/tworkshops/

http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-%20%20%20%20%20%20%20%20%20%20%20%0d%20%20%20%20%20%20%20%20%20%20%20mapreduce-

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-%20%20%20%20%20%20%20%20%20%20%20%0d%20%20%20%20%20%20%20%20%20%20%20mapreduce-

99

[93] T.Deshane, M. Yehuda, A. Shah, B. Rao, “Quantitative Comparison of Xen and KVM”, in

Journal of Physics: Conference, 2010

[94] “Virtualizing Resource intensive Applications”, white paper, VMware, 2009

[95] “Scale-up Virtualization with Red Hat Enterprise Linux 5.4 on an HP ProLiant DL785

G6”, white paper, Redhat, 2009

[96] “KVM Virtualized I/O Performance”, white paper, IBM & Redhat, 2013.

100

Appendix A: OpenStack with KVM Configuration

Pre-configuration

1. Update your machine

sudo apt-get update

sudo apt-get upgrade

2. Install bridge-utils

sudo apt-get install bridge-utils

3. NTP Server

3.1. Install the NTP Server

sudo apt-get install ntp

3.2. Open the file /etc/ntp.conf

Add the following lines to make sure that the time on the server stays in sync with an external

server.

server ntp.ubuntu.com

server 127.127.1.0

fudge 127.127.1.0 stratum 10

3.3.Restart NTP Service

sudo service ntp restart

4. Network Configuration

As public IP address changes periodically, you need to set a static IP address that will be used

in OpenStack configuration. In this case, we have two network interfaces eth0 and eth1. Eth0

was chosen as the network management; as a result, this interface was set to static IP address

(in this guide, we used 10.60.62.12 as an IP management).

101

Hypervisor Configuration

1. KVM Configuration

If you want to install OpenStack with KVM hypervisor, then you need to follow the following

steps:

1.1.Check if your machine supports virtualization

ouidad@ouidad:~$ egrep -c '(vmx|svm)' /proc/cpuinfo

8

ouidad@ouidad:~$

If the output is 0, then your machine does not support virtualization; otherwise, if the output

is greater than 0, the machine support virtualization technology.

1.2. Check if KVM can be supported

ouidad@ouidad:~$ kvm-ok

INFO: /dev/kvm exists

KVM acceleration can be used

ouidad@ouidad:~$

If the output is as shown above, then your machine supports KVM virtualization.

1.3.Install KVM and libvirt

sudo apt-get install kvm libvirt-bin

1.4.KVM configuration

You can check the following website to configure the necessary files for KVM support:

https://help.ubuntu.com/community/KVM/Installation

1.5 Reboot your machine

102

OpenStack Databases Configuration

1. MySQL

1.1.Install Mysql server and related packages

sudo apt-get install mysql-server python-mysqldb

1.2.Create the root password for MySQL

The password used in this guide is "secret"

1.3.Open /etc/mysql/my.cnf

Change the bind address from bind-address=127.0.0.1 to bind-address = 0.0.0.0

1.4. Restart MySQL server

sudo restart mysql

2. Nova Database

2.1. Create Nova database “nova”

sudo mysql -uroot -psecret -e 'CREATE DATABASE nova;'

2.2.Create nova user named “novadbadmin”

sudo mysql -uroot -psecret -e 'CREATE USER novadbadmin;'

2.3.Grant all privileges for novadbadmin on the database "nova"

sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON nova.* TO 'novadbadmin'@'%';"

2.4. Create a password for the user "novadbadmin"; the password in this case is “secret”

sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'novadbadmin'@'%' = PASSWORD ('novasecret');"

3. Glance Database

3.1.Create glance database named “glance”

sudo mysql -uroot -psecret -e 'CREATE DATABASE glance;'

103

3.2.Create a user named “glancedbadmin”

sudo mysql -uroot -psecret -e 'CREATE USER glancedbadmin; '

3.3. Grant all privileges for glancedbadmin on the database "glance"

sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON glance.* TO 'glancedbadmin'@'%';"

3.4. Create a password for the user "glancedbadmin"

sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'glancedbadmin'@'%' = PASSWORD('glancesecret');"

4. Keystone Database

4.1.Create a database named “keystone”

sudo mysql -uroot -psecret -e 'CREATE DATABASE keystone;'

4.2.Create a user named “keystonedbadmin”.

sudo mysql -uroot -psecret -e 'CREATE USER keystonedbadmin;'

4.3. Grant all privileges for keystonedbadmin on the database "keystone".

sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON keystone.* TO 'keystonedbadmin'@'%';"

4.4.Create a password for the user "keystonedbadmin"

sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'keystonedbadmin'@'%' = PASSWORD('keystonesecret');"

104

Keystone Configuration

1. Install Keystone

sudo apt-get install keystone python-keystone python-keystoneclient

2. Open /etc/keystone/keystone.conf

Make the following changes:

Change admin_token = ADMIN to admin_token = admin

Change connection = sqlite:////var/lib/keystone/keystone.db

to connection = mysql://keystonedbadmin:[email protected]/keystone

3. Restart keystone

sudo service keystone restart

4. Create glance schema in MySQL databas

sudo keystone-manage db_sync

5. Export environment variables

export SERVICE_ENDPOINT="http://localhost:35357/v2.0"

export SERVICE_TOKEN=admin

Note: you can also add these variables to ~/.bashrc as to avoid exporting them each time.

6. Create tenants

Create admin and service tenants

keystone tenant-create --name admin

keystone tenant-create --name service

7. Create users

Create OpenStack users by executing the following commands. In this case, we are creating

four users - admin, nova, glance and swift

keystone user-create --name admin --pass admin --email [email protected]

keystone user-create --name nova --pass nova --email [email protected]

keystone user-create --name glance --pass glance --email [email protected]

keystone user-create --name swift --pass swift --email [email protected]

105

8. Create roles

Create the roles by executing the following commands. In this case, we are creating two roles

- admin and Member.

keystone role-create --name admin

keystone role-create --name Member

Sample output:

9. List tenants, users and roles

keystone tenant-list

keystone user-list

keystone role-list

Sample output:

106

10. Adding roles to users in tenants

10.1. Add the role of “admin” to the user “admin” of the tenant “admin” keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2

--role 8af19783ac784e0397e0346c7f1ec --tenant_id ee14adbd1ac84445921

819cf7a5b7f5f

10.2. Add the role of “admin” to the user “nova” of the tenant ’service’.

keystone user-role-add --user 5ce6dd40bf2249e5ab35a95da63d7930

--role 8af19783ac784e0397e0346c7f1ec

--tenant_id 11824c8169924b098f41dae1fa726c6

10.3. Add the role of “admin” to the user “glance” of the tenant ’service’.

keystone user-role-add --user 9967843ee4aa421189f3382849700cad

--role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d

ae1fa726c6

10.4. Add the role of “admin” to the user “swift” of the tenant ’service’.

keystone user-role-add --user 24979d9ac31e4b83a58a89c1ad842ffa

--role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d

ae1fa726c6

10.5. The ’Member’ role is used by Horizon and Swift. So add the ’Member’ role

accordingly. (user: admin , role: Member , tenant: admin)

keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role

c2860fd6f3fd4538a07161bdb2691f60 --tenant_id ee14adbd1ac84445921

819cf7a5b7f5f

11. Create services

Create the required services which the users can authenticate with: nova-compute, nova-

volume, glance, swift, keystone and ec2 are some of the services that we create.

11.1.Nova Compute Service

keystone service-create --name nova --type compute --description 'Opensatck Compute Service'

107

11.2.Volume Service

keystone service-create --name volume --type volume --description 'OpenStack Volume Service'

11.3.Image Service

keystone service-create --name glance --type image --description 'Openstack Image Service'

11.4. Object Store Service

keystone service-create --name swift --type object_store --description 'Openstack Storage Service'

11.5.Identity Service

keystone service-create --name keystone --type identity --description 'Openstack Identity Service'

11.6.EC2 Service

keystone service-create --name ec2 --type ec2 --description 'EC2 Service'

12. List keystone service list

keystone service-list

Sample output:

108

13. Create endpoints

Create endpoints for each of the services that have been created above (service id is displayed

using keystone service-list command).

13.1. Endpoint for identity service

keystone endpoint-create --region RegionOne --service_id

207bf81ddfe1481aa242148f246d091f --publicurl http://localhost:5000/v2.0 --internalurl

http://localhost:5000/v2.0 --adminurl http://localhost:35357/v2.0

13.2.Endpoint for nova service


72b9d125eaa84aaf9c8ce734027eea21 --publicurl 'http://localhost:8774/v2/%(tenant_id)s' --

internalurl 'http://localhost:8774/v2/%(tenant_id)s' --adminurl

'http://localhost:8774/v2/%(tenant_id)s'

13.3.Endpoint for the image service


581f6a8e337642a0a39090ffe6947e2d --publicurl 'http://localhost:9292/v1' --internalurl

'http://localhost:9292/v1' --adminurl 'http://localhost:9292/v1'

13.4.Define the EC2 compatibility service:


4b1619d4f9f34cc9aaf473282c2340f0 --publicurl http://localhost:8773/services/Cloud --

internalurl http://localhost:8773/services/Cloud --adminurl http://localhost:8773/services/Admin

13.5.Endpoint for the Volume service


6afe27a1768b403b9521418a87646ec4 --publicurl 'http://localhost:8776/v1/%(tenant_id)s' --

internalurl 'http://localhost:8776/v1/%(tenant_id)s' --adminurl

'http://localhost:8776/v1/%(tenant_id)s'

13.6.Endpoint for object storage service


2ec242420a114671a4fe15e745b45d3f --publicurl

'http://localhost:8888/v1/AUTH_%(tenant_id)s' --adminurl 'http://localhost:8888/v1' --

internalurl 'http://localhost:8888/v1/AUTH_%(tenant_id)s'

http://localhost:35357/v2.0

http://localhost:8773/services/Admin

109

Glance Configuration

1. Install Glance packages

sudo apt-get install glance glance-api glance-client glance-common glance-registry python-glance

2. Open /etc/glance/glance-api-paste.ini

Change the following lines:

admin_tenant_name = %SERVICE_TENANT_NAME%

admin_user = %SERVICE_USER%

admin_password = %SERVICE_PASSWORD%

By:

admin_tenant_name = service

admin_user = glance

admin_password = glance

3. Now open /etc/glance/glance-registry-paste.ini

Change the following lines:




By:


admin_user = glance

admin_password = glance

4. Open the file /etc/glance/glance-registry.conf

Change the line which contains the option "sql_connection =" to this:

sql_connection = mysql://glancedbadmin:[email protected]/glance

Add the following lines at the end of the file as to allow glance to use keystone for

authentication.

[paste_deploy]

flavor = keystone

110

5. Open /etc/glance/glance-api.conf

Add the following lines at the end of the file.

[paste_deploy]

flavor = keystone

6. Create glance schema in MySQL database

sudo glance-manage version_control 0

sudo glance-manage db_sync

7. Restart glance-api and glance-registry

sudo restart glance-api

sudo restart glance-registry

8. Export the following environment variables.

export SERVICE_TOKEN=admin

export OS_TENANT_NAME=admin

export OS_USERNAME=admin

export OS_PASSWORD=admin

export OS_AUTH_URL="http://localhost:5000/v2.0/"

export SERVICE_ENDPOINT=http://localhost:35357/v2.0

Note: you can add these variables to ~/.bashrc.

9. Check if glance was successfully configured

glance index

The above command displays nothing; if you get an output, check the troubleshooting section.

111

Nova Configuration

1. Install Nova packages

sudo apt-get install nova-api nova-cert nova-compute nova-compute-kvm nova-doc nova-

network nova-objectstore nova-scheduler nova-volume rabbitmq-server novnc nova-

consoleauth

2. Edit the /etc/nova/nova.conf file

--dhcpbridge_flagfile=/etc/nova/nova.conf

--dhcpbridge=/usr/bin/nova-dhcpbridge

--logdir=/var/log/nova

--state_path=/var/lib/nova

--lock_path =/run/lock/nova

--allow_admin_api=true

--use_deprecated_auth=false

--auth_strategy=keystone

--scheduler_driver=nova.scheduler.simple.SimpleScheduler

--s3_host =10.60.62.12

--ec2_host=10.60.62.12

--rabbit_host=10.60.62.12

--cc_host =10.60.62.12

--nova_url=http://10.60.62.12:8774/v1.1/

--routing_source_ip=10.60.62.12

--glance_api_servers=10.60.62.12:9292

--image_service=nova.image.glance.GlanceImageService

--iscsi_ip_prefix=192.168.4

--sql_connection=mysql://novadbadmin:[email protected]/nova

--ec2_url=http://10.60.62.12:8773/services/Cloud

--keystone_ec2_url=http://10.60.62.12:5000/v2.0/ec2tokens

--api_paste_config=/etc/nova/api-paste.ini

--libvirt_type=kvm

--libvirt_use_virtio_for_bridges=true

--start_guests_on_host_boot=true

--resume_guests_state_on_host_boot=true

--novnc_enabled=true

--novncproxy_base_url=http://10.60.62.12:6080/vnc_auto.html

--vncserver_proxyclient_address=10.60.62.12

--vncserver_listen=10.60.62.12

--network_manager=nova.network.manager.FlatDHCPManager

--public_interface=eth0

--flat_interface=eth0

--flat_network_bridge=br100

--network_size=32

--flat_injected=False

--force_dhcp_release

--iscsi_helper=tgtadm

--connection_type=libvirt

--root_help

Important Note: “10.60.62.12” has to be replaced by your local machine public IP address.

Moreover, you need to change “libvirt_type” variable by the current hypervisor you are using.

112

3. Change the ownership of the /etc/nova folder and permissions for

/etc/nova/nova.conf

sudo chown -R nova:nova /etc/nova

sudo chmod 644 /etc/nova/nova.conf

4. Open /etc/nova/api-paste.ini

Change the following configuration




By:


admin_user = nova

admin_password = nova

5. Create nova schema in the MySQL database.

sudo nova-manage db sync

6. Provide a range of IPs to be associated to the instances.

sudo nova-manage network create private --fixed_range_v4=10.60.62.0/27 --

bridge=br100 --bridge_interface=eth0 --network_size=32

7. Export the following environment variables.

export OS_TENANT_NAME=admin

export OS_USERNAME=admin

export OS_PASSWORD=admin

export OS_AUTH_URL="http://localhost:5000/v2.0/"

Note: you can add the environment variables at the end of ~/.bashrc file.

8. Manage nova volumes

Create a Physical Volume:

sudo pvcreate /dev/sda3

Create a Volume Group named nova-volumes:

sudo vgcreate nova-volumes /dev/sda3

113

Note: to create a physical volume, you need first to create a primary partition (in this guide,

the partition name is /dev/sda3). In this case you can follow these steps:

9. Restart nova services

sudo service libvirt-bin restart

sudo service nova-network restart

sudo service nova-compute

sudo service nova-api restart

sudo service nova-objectstore restart

sudo service nova-scheduler restart

sudo service nova-volume restart

sudo service nova-consoleauth service

10. Check if nova services are running

sudo nova-manage service list

Sample output:

Note: if you the state of a given service is not :-), then try to run the following commands in

separate terminals:

sudo /usr/bin/nova-compute

sudo /usr/bin/nova-network

…

114

OpenStack Dashboard

1. Install OpenStack Dashboard

sudo apt-get install openstack-dashboard

2. Restart apache service

sudo service apache2 restart

3. Open a browser and enter IP address of your machine

If you followed this tutorial, then the possible logins are:

Username: admin Password: admin

Username: nova Password: nova

Username: glance Password: glance

Username: swift Password swift

Figure 1: Dashboard authentication page

115

Image Configuration

In order to create an image, you can to access the following links to download the needed

images:

http://smoser.brickies.net/ubuntu/ttylinux-uec/old/

http://uec-images.ubuntu.com/

Example: Ubuntu Precise i386 Image

1. Download Ubuntu Precise Version (12.04 LTS)

Download Ubuntu precise version (precise-server-cloudimg-i386-root.tar.gz) from http://uec-

images.ubuntu.com/precise/current/, using the following command:

wget http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386.tar.gz

2. Extract the downloaded package

sudo tar fxvz precise-server-cloudimg-i386.tar.gz

The extracted files are:

precise- server-cloudimg-i386-vmlinuz-virtual

precise-server-cloudimg-i386-loader

precise-server-cloudimg-i386.img

3. Add the Ubuntu image into glance database

3.1. Add the kernel file

glance add name="precise32-kernel" disk_format=aki container_format=aki < precise-

server-cloudimg-i386-vmlinuz-virtual

3.2. Add the loader file

glance add name="precise32-ramdisk" disk_format=ari container_format=ari < precise-

server-cloudimg-i386-loader

3.3.Add the image file

Get the id of both the kernel and loader using: glance index

glance index

Sample output:

http://uec-images.ubuntu.com/

http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386-root.tar.gz

http://uec-images.ubuntu.com/precise/current/

http://uec-images.ubuntu.com/precise/current/

http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386.tar.gz

116

In this case, the id of Ubuntu kernel is 8386c173-cd90-4c7d-8540-da484abd0c1a and the id

of Ubuntu loader is 5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d.

Now, add the image file using the kernel and loader id:

glance add name="Precise32_image" disk_format=ami container_format=ami

kernel_id=8386c173-cd90-4c7d-8540-da484abd0c1a ramdisk_id=5e0f8ceb-8fcd-4fc7-

9b2b-1bcd3e3d8c9d < precise-server-cloudimg-i386.img

4. Using the Horizon, you can find the uploaded image (Precise32_image)

Figure 2: List of OpenStack images

117

Keypair Configuration

1. Generate for you local machine

If you didnt generate akey for you local machine, then run the following command :

ssh-keygen -t rsa -P ""

2. Create keypair

The following command can be used to either generate a new keypair, or to upload an existing

public key.

cd .ssh

nova keypair-add --pub_key id_rsa.pub mykey

nova keypair-list

3. List keypairs

nova keypair-list

Sample output:

4. Check the created keypair

Confirm that the uploaded keypair matches the local key by checking your key's fingerprint

with the ssh-keygen command:

ssh-keygen –l –f ~/.ssh/id_rsa.pub

Sample output:

Note: You can use OpenStack Dashboard to perform all operations related to keypair

generation.

118

Security Groups Configuration

1. List default security groups

nova secgroup-list

Sample output:

2. Enable access to TCP port 22

Allow access to port 22 from all IP addresses (specified in CIDR notation as 0.0.0.0/0) with

the following command:

nova secgroup-add-rule default tcp 22 22 0.0.0.0/0

Sample output:

3. Enable pinging to virtual machine instance by allowing ICMP traffic

nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0

Sample output:

119

Flavors Configuration

1. Flavor overview

Flavors are used to specify the properties of an instance. The following table illustrates the

needed arguments to define a flavor.

Column Description

ID A unique numeric id.

Name A descriptive name. xx.size_name is conventional not required,

though some third party tools may rely on it.

Memory_MB Memory_MB: virtual machine memory in megabytes.

Disk

Virtual root disk size in gigabytes. This is an ephemeral disk the

base image is copied into. When booting from a persistent volume it

is not used. The "0" size is a special case which uses the native base

image size as the size of the ephemeral root volume.

Ephemeral Specifies the size of a secondary ephemeral data disk. This is an

empty, unformatted disk and exists only for the life of the instance.

Swap Optional swap space allocation for the instance.

VCPUs Number of virtual CPUs presented to the instance.

TX_Factor

Optional property allows created servers to have a different

bandwidth cap than that defined in the network they are attached to.

This factor is multiplied by the rxtx_base property of the network.

Default value is 1.0 (that is, the same as attached network).

Is_Public Boolean value, whether flavor is available to all users or private to

the tenant it was created in. Defaults to True.

extra_specs

Additional optional restrictions on which compute nodes the flavor

can run on. This is implemented as key/value pairs that must match

against the corresponding key/value pairs on compute nodes. Can be

used to implement things like special resources (such as flavors that

can only run on compute nodes with GPU hardware).

Table 1: Flavor arguments

2. List available flavors

Use nova flavor-list command to view the list of available flavors:

nova flavor-list

3. Create a flavor

Create a flavor with the following suggested specifications:

sudo nova-manage instance_type create --name=m1.cluster --memory=975 --cpu=2 --

root_gb=100 --ephemeral_gb=10 --flavor=8

120

Instances Management

Instances can be created either by using the dashboard interface or using command line.

1. Create instances with no specifications

nova boot --flavor ID --image Image-ID MyInstanceName

2. Create an instance with an associated keypair

To associate a key with an instance on boot add --key_name Mykey to your command line:

nova boot --image Image-ID --flavor ID --key_name Mykey MyInstanceName

3. Create an instance with a security group

It is also possible to add and remove security groups when an instance is running.

nova add-secgroup MyInstanceName MysecurityGroup

nova remove-secgroup MyInstanceName MysecurityGroup

4. Create an instance with a given keypair and security group

nova boot --flavor ID --image Image-ID --key_name Mykey MyInstanceName

5. Display instance details

nova show MyInstanceName

6. Access an instance

You can connect to an instance console via VNC. The latter can be accessed either by the

Horizon interface, command line or other tools such as virt-manager.

Using command line

nova get-vnc-console host_name novnc

Sample output:

The link displayed above can be used to access the instance console.

121

Using virt-manager

If you cannot connect to VNC console, you can use virt-manager; in this case, use the

following command to download the virt-manager package:

sudo apt-get install virt-manager

To have access to virt-manager inetrface, run the following command,

sudo virt-manager

Using local machine terminal

If the instance you created asked you for login name and password, you can in this case,

access the instance through your local machine. In this case you need to follow these steps:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@Instance_ip_address

For Ubuntu the user name is root or ubuntu.

Example: if you want to access an Ubuntu instance with IP address 10.60.62.8, you can then

run the commands in the following commands:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub [email protected]

ssh [email protected]

Sample output:

122

7. Connecting Instances

The following steps can be followed to connect OpenStack Instances (Assumption: we need

to connect instance with hostname host1 to another instance with hostname host2):

Generate the keypair on host1 & host2 to run ssh (ssh-keygen -t rsa)

On host2

o Check the “sshd_config” on that instance (It’s located in /etc/ssh/sshd_config)

o Uncomment the following two lines in sshd_config

RSAAuthentication yes

PubkeyAuthentication yes

o Append the contents of id_rsa.pub file of host 1 to authorized_keys file of host 2

8. Delete an instance

nova delete MyInstanceName

123

OpenStack Troubleshooting

Glance Exceptions

1. Exception 1: “glance index “ error

ouidad@ouidad:~$ glance index

Failed to show index. Got error:

There was an error connecting to a server

Details: [Errno 111] Connection refused

Solution

In most cases, the above exception is due to glance-api service which may not be running.

Therefore, you need to run the following command to check why the glance-api is not

running.

For the above output, we have an error in the glance-api-paste.ini, so you need to open that file to fix the

error.

ouidad@ouidad:~$ sudo gedit /etc/glance/glance-api-paste.ini

After fixing the error, you need to restart the glance-api service

ouidad@ouidad:~$ sud/usr/bin/glance-apini

124

Nova Exceptions

1. Exception 1: nova services not running “sudo nova-manage service list”

When running “sudo nova-manage service list”, if you a service has “xxx” state, then you

need the service in a separate terminal.

Solution

For example, if nova-compute has “xxx” state, you need to run the following command:

sudo /usr/bin/nova-compute

The same solution can be applied for other services:

sudo /usr/bin/nova-network

sudo /usr/bin/nova-scheduler

sudo /usr/bin/nova-consoleauth

sudo /usr/bin/nova-cert

sudo /usr/bin/nova-volume

2. Exception 2: “sudo nova-manage service list” doesn’t display the expected output

ouidad@ouidad:~$ sudo nova-manage service list

Command failed, please check log for more info

2013-09-02 19:46:28.050 15999 CRITICAL nova [-] No module named

quantumclient.common

Solution

ouidad@ouidad:~$ sudo apt-get install python-quantumclient

3. Exception 3: Unable to start nova compute “libvirtError: operation failed: domain

'instance-…..‘ already exists with uuid …”

Sample output:

Solution

You need to login to nova database and delete the instance id from instances table. Moreover,

you need to delete the instance id from related tables such as

security_group_instance_association and instance_info_caches.

125

Example: we want to delete an instance with id=3

From the tables displayed above, delete the instance id = 3 from

security_group_instance_association and instance_info_caches as well as from

virtual_interfaces table.

126

Dashboard Exceptions

1. Exception 1: “Unable to retrieve images/instances…”

Sample output

Solution

If you get one of the following exceptions, the only way I solved the problem is to drop the

endpoint and re-create them again. Then, you need to reboot your local machine.

References for Appendix A

http://docs.openstack.org/folsom/openstack-ops/content/flavors.html

http://www.hastexo.com/resources/docs/installing-openstack-essex-20121-ubuntu-1204-

precise-pangolin

http://docs.openstack.org/essex/openstack-

compute/starter/content/Introduction_to_OpenStack_and_its_components-d1e59.html

127

Appendix B. OpenStack with VMware ESXi

Configuration

1. Downloading VMware ESXi

Download VMware ESXi (vSphere 5.5) from:

https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55

2. Installing VMware ESXi

After burning VMware ESXi software into a CD, install it on top of your hardware.

3. Download vSphere Client

To manage your VMware ESXi host:

Install vSphere Client in another machine with Windows OS.

After opening the software, login to VMware ESXi machine with your username and

password.

After login, you will get access to VMware ESXi machine resources. In our case,

VMware ESXi machine has an IP address of 10.50.1.166 (Figure 1)

Figure 1: vSphere Client interface: access to VMware ESXi 10.50.1.166

128

4. Create “Openstack VM”

Create a virtual machine on top of VMware ESXi using vSphere Client. The VM will

be used to host OpenStack.

Create the VM with Ubuntu Precise LTS 12.04 64bits Guest.

5. Download VMware vSphere Web Services SDK

Download appropriate SDK from: http://www.vmware.com/support/developer/vc-

sdk/

Copy the SDK to /openstack/vmware file.

Make sure that the WSDL is available by checking if this path is existing

/openstack /vmware/SDK/wsdl/vim25/vimService.wsdl

/openstack /vmware/SDK/wsdl/vim25/vimService.wsdl: this path will be specified

in nova.conf.

6. Configure OpenStack on “VMware ESXi”

You need to follow the same steps provided in OpenStack –KVM documentation.

The main difference here is the nova.conf configuration.

7. Nova.conf Configuration

In this case, you need to specify the compute_driver, host_ip (VMware ESXi machine),

host_username , host_password and sdl_location (for SDK) as follow

[vmware]

host_password = 12357890

host_username = root

host_ip = 10.50.1.166

compute_driver = vmwareapi.VMwareESXDriver

sdl_location=file:///openstack /vmware/SDK/wsdl/vim25/vimService.wsdl

8. Dashboard access

Access OpenStack resources from the Horizon using the IP address of “Openstack VM”.

9. Make sure that you OpenStack is installed wth VMware ESXi

This is done from Horizon interface

Example:

http://www.vmware.com/support/developer/vc-sdk/

http://www.vmware.com/support/developer/vc-sdk/

129

Figure 2: OpenStack with VMware ESXi hypervisor

10. Manage OpenStack with VMware ESXi

After configuring OpenStack, you can now download images and create instances.

Each time you create an instance, it will be displayed in vSphere Client interface as

depicted in Figure 1.

Concerning images, you need to add images with vmdk extension. You can find them

in the following website (you can download them from the free images section):

http://stacklet.com

http://stacklet.com/downloads/images/xen/ubuntu/12.10/xfce/x86-64

130

Figure 3: access to VMs (OpenStack instances) through vSphere Client interface

References

http://docs.openstack.org/trunk/config-reference/content/vmware.html


https://www.vmware.com/support/developer/vc-sdk/

http://docs.openstack.org/trunk/config-reference/content/vmware.html


131

Appendix C: Hadoop Configuration

Prerequisites for Installing Hadoop

1. Adding a dedicated Hadoop system user (all machines)

Create a Hadoop user account (hduser) for running Hadoop using the following commands:

ouidad@host1:~$ sudo addgroup hadoop

ouidad@host1:~$ sudo adduser --ingroup hadoop hduser

2. Configuring SSH

2.1. To manage cluster’ nodes, Hadoop requires SSH access. In this case, you need to

generate an SSH key for the hduser user.

ouidad@host1:~$ su hduser

Password:

hduser@host1:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

The key fingerprint is:

44:f5:7b:85:32:f7:69:c7:d7:fc:75:38:63:32:be:d7 hduser@host1

The key's randomart image is:

+--[ RSA 2048]----+

| ... |

| . . . |

| . + o .|

| . = *o|

| S + *oX|

| . =.o*|

| . ..|

| .. E|

| .. |

+-----------------+

132

2.2. In order to allow Hadoop interacts directly with its nodes, you need to create an RSA key

pair with an empty password. This is done by enable SSH access to your local machine

with this newly created key.

hduser@host1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3. Install JAVA

3.1.Download jdk-6u45-linux-i586.bin (for 32 bits architecture) from:

http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-

419409.html

3.2.JDK Installation

chmod +x jdk-6u45-linux-i586.bin

sudo ./jdk-6u45-linux-i586.bin

3.3.Make sure that JDK is installed

ouidad@host1:~$ java -version

java version "1.6.0_24"

Java(TM) SE Runtime Environment (build 1.6.0_24-b07)

Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)

3.4. Move JDK folder from its current location to /home/hduser path

ouidad@host1:~$ sudo cp /Downloads/jdk1.6.0_45 /home/hduser -r

3.5. Change the JDK ownership

ouidad@host1:~$ sudo chown -R hduser:hadoop /home/hduser/jdk1.6.0_45/

http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html

http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html

133

Installing Hadoop

1. Download Hadoop version 1.2.1 (hadoop-1.2.1.tar.gz) from

http://www.apache.org/dyn/closer.cgi/hadoop/core

2. Extract the downloaded version

ouidad@host1:~/Downloads$ tar -zxvf hadoop-1.2.1.tar.gz

3. Move the extracted folder (hadoop-1.2.1) from Downloads folder to /home/hduser

ouidad@host1:~/Downloads$ sudo cp hadoop-1.2.1 /home/hduser/ -r

4. Change the ownership

ouidad@host1:~/Downloads$ sudo chown -R hduser:hadoop /home/hduser/hadoop-1.2.1

5. Bashrc file configuration (All machines)

You need first to login to the hduser account, then you need to run the following command:

hduser@host1:~$ sudo gedit ~/.bashrc

at the end of the file, add the following line:

export JAVA_HOME=~/jdk1.6.0_45

export PATH =$JAVA_HOME/bin:$PATH

6. Hdfs folder creation (All machines)

You need first to login to the hduser account, then create the following folder:

hduser@host1:~$ sudo mkdir -p /home/hduser/hdfs/temp

hduser@host1:~$sudo chown hduser:hadoop /home/hduser/hdfs/temp

hduser@host1:~$sudo chmod 777 /home/hduser/hdfs/temp/


http://www.apache.org/dyn/closer.cgi/hadoop/core

134

7. Hadoop Files Configuration (Slave machines)

Move to /hadoop-1.2.1/conf folder to change the following files

7.1. hadoop-env.sh File

hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hadoop-env.sh

Replace the following two lines:

# The java implementation to use. Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

by (uncomment the second line):

# The java implementation to use. Required.

export JAVA_HOME=~/jdk1.6.0_45

Then, add at the end of the file:

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

7.2. core-site.xml File

hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml

Add the following lines between the <configuration> tags:

<property>

<name>hadoop.tmp.dir</name>

<value>/home/hduser/hdfs/temp</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://master:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

<property>

<name>dfs.name.dir</name>


</property>

135

7.3 .mapred-site.xml File

hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml


<property>

<name>mapred.job.tracker</name>

<value>master:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

7.4. hdfs-site.xml File

hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml

<property>

<name>dfs.replication</name>

<value>3 </value>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

Note: Number 3 illustrates the total number of block replication. If you have a cluster of 3-10

nodes, set the replication factor to 3

8. Hadoop Files Configuration (Master)

8.1. core-site.xml File

hduser@master:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml


<property>

<name>fs.default.name</name>

<value>hdfs://master:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

<property> <name>hadoop.tmp.dir</name>


<description>A base for other temporary directories.</description> </property>

136

<property>

<name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property>

8.2 .mapred-site.xml File

hduser@master:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml


<property>

<name>mapred.job.tracker</name>

<value>master: 54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task. </description> </property>

8.3. hdfs-site.xml File

hduser@master:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml


<property>

<name>dfs.replication</name>

<value>3

</value>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

8.4. slaves File

hduser@master:~/hadoop-1.2.1/conf$ sudo gedit slaves

Comment the localhost, and add the name of your slaves (you can set your master node as

master and slave at the same by adding the master hostname to slaves file.

master

host1

host2

.

8.4. masters File

hduser@master:~/hadoop-1.2.1/conf$ sudo gedit masters

Comment the localhost, and add the name of your master node.

137

master

Connecting Nodes

1. IP address configuration (All machines)

1.1. Find out the IP address of each machine

hduser@host1:~$ ifconfig

eth0 Link encap:Ethernet HWaddr 00:23:ae:b0:89:ae

inet addr:10.50.0.170 Bcast:10.50.255.255 Mask:255.255.0.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:198693 errors:0 dropped:0 overruns:0 frame:0

TX packets:9134 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:30871002 (30.8 MB) TX bytes:1334436 (1.3 MB)

Interrupt:21 Memory:fe6e0000-fe700000

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

UP LOOPBACK RUNNING MTU:16436 Metric:1

RX packets:58 errors:0 dropped:0 overruns:0 frame:0

TX packets:58 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:9306 (9.3 KB) TX bytes:9306 (9.3 KB)

1.2. Find out the host name of each machine

hduser@host1:~$ sudo gedit /etc/hostname

1.1. Open hosts file (for each machine)

hduser@host1:~$ sudo gedit /etc/hosts

Replace the content of the file by the IP Addresses of all machines, including in the cluster.

10.50.0.197 master

10.50.0.94 slave

….…

2. Connect the master hduser with the hduser on slaves

Example: For machine with hostname host1

138

hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host1

Example: For machine with hostname host2

hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host2

3. Test the connection between each slave and master machine

hduser@master:~$ ssh host1

Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic i686)

* Documentation: https://help.ubuntu.com/

System information as of Sun Jun 30 19:44:28 WEST 2013

System load: 0.08 Processes: 159

Usage of /: 77.7% of 228.23GB Users logged in: 2

Memory usage: 35% IP address for eth0: 10.50.0.170

Swap usage: 0%

=> There is 1 zombie process.

Graph this data and manage this system at https://landscape.canonical.com/

97 packages can be updated.

66 updates are security updates.

Last login: Sun Jun 30 18:39:15 2013 from ip6-localhost

If the connection is set up, you need then to cancel it to continue your installation

hduser@host5:~$ exit

logout

Connection to host5 closed.

139

Formatting the HDFS & Starting Multi-node Cluster

1. Format the HDFS filesystem via the NameNode

hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format

Here is the output:

13/06/30 20:00:42 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = master/10.50.0.197

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 1.2.1

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by

'hortonfo' on Thu Jan 31 02:03:24 UTC 2013

************************************************************/

13/06/30 20:00:42 INFO util.GSet: VM type = 32-bit

13/06/30 20:00:42 INFO util.GSet: 2% max memory = 19.33375 MB

13/06/30 20:00:42 INFO util.GSet: capacity = 2^22 = 4194304 entries

13/06/30 20:00:42 INFO util.GSet: recommended=4194304, actual=4194304

13/06/30 20:00:42 INFO namenode.FSNamesystem: fsOwner=hduser

13/06/30 20:00:42 INFO namenode.FSNamesystem: supergroup=supergroup

13/06/30 20:00:42 INFO namenode.FSNamesystem: isPermissionEnabled=true

13/06/30 20:00:42 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

13/06/30 20:00:42 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),

accessTokenLifetime=0 min(s)

13/06/30 20:00:42 INFO namenode.NameNode: Caching file names occuring more than 10 times

13/06/30 20:00:42 INFO common.Storage: Image file of size 112 saved in 0 seconds.

13/06/30 20:00:42 INFO namenode.FSEditLog: closing edit log: position=4,

editlog=/home/hduser/hdfs/temp/dfs/name/current/edits

13/06/30 20:00:42 INFO namenode.FSEditLog: close success: truncate to 4,

editlog=/home/hduser/hdfs/temp/dfs/name/current/edits

13/06/30 20:00:43 INFO common.Storage: Storage directory /home/hduser/hdfs/temp/dfs/name has been successfully

formatted.

13/06/30 20:00:43 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197

************************************************************/

2. Start the multi-node cluster

hduser@master:~/hadoop-1.2.1$ bin/start-all.sh

Start both DFS and Hadoop Map/Reduce daemons:

hduser@master:~/hadoop-1.2.1$ bin/start-dfs.sh

hduser@master:~/hadoop-1.2.1$ bin/start-mapred.sh

140

4. On master machine, check if the following java processes are running :

hduser@master:~$ jps

5721 SecondaryNameNode

6738 DataNode

5243 NameNode

6047 TaskTracker

8423 Jps

5805 JobTracker

4. On slave machines, check if the following java processes are running:

hduser@master:~$ jps

1902 DataNode

4002 Jps

2108 TaskTracker

If you get the following oputput:

hduser@host1:~/hadoop-1.2.1/conf$ jps

The program 'jps' can be found in the following packages:

* openjdk-6-jdk

* openjdk-7-jdk

Ask your administrator to install one of them

Then install one of the suggested packages:

hduser@host1:~/hadoop-1.2.1/conf$ sudo apt-get install openjdk-7-jdk

Note: if you didn’t get the same services, follow the suggestion provided for exception 2.

141

Hadoop Troubleshooting

1. Formatting the Namenode Exception: “Cannot lock storage…”


13/06/30 19:57:35 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = master/10.50.0.197

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 1.2.1

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782;

compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013

************************************************************/

…

….

13/06/30 19:57:38 ERROR namenode.NameNode: java.io.IOException: Cannot lock storage

/home/hduser/hdfs/temp/dfs/name. The directory is already locked.

at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:599)

at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1327)

at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1345)

at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1207)

at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1398)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1419)

13/06/30 19:57:38 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197

************************************************************/

Solution

Step 1: Stop all processes

hduser@master:~/hadoop-1.2.1$ bin/stop-all.sh

Step 2 : move to /hdfs/temp folder and run the following command

hduser@master:~/hdfs/temp$ sudo rm -rf *

Step 3 : Restart your work by formatting the namenode


142

2. Formatting the Namenode Exception: “Cannot create directory

/home/hduser/hdfs…”

Solution

In this case, make sure that you have set the following permission when creating the

/hdfs/temp folder


3. Exception in log file: hadoop-hduser-datanode-host1.log or when Hadoop DataNode

doesn’t show up in slave nodes

hduser@host1:~/hadoop-1.2.1/logs$ sudo gedit hadoop-hduser-datanode-host1.log 2013-06-30 19:01:09,078 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:

Incompatible namespaceIDs in /home/hduser/hdfs/temp/dfs/data: namenode namespaceID = 1345454277;

datanode namespaceID = 1875045188

at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)

at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:399)

at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:309)

at org.apache.hadoop. hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1651)

at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1590)

at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1608)

at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1734)

at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1751)

Solution 1

1. From master machine, open VERSION file under /hdfs/temp/dfs/name/current folder:

hduser@master:~/hdfs/temp/dfs/name/current$ sudo gedit VERSION

Here is the content of VERSION file:

#Sun Jun 30 20:00:43 WEST 2013

namespaceID=1289101159

cTime=0

storageType=NAME_NODE

layoutVersion=-32

Check the id of the namespace variable ( in this case it is 1289101159); remember the id as

you will need it in the next step

2. From all slaves machines where you found the above exception, open the VERSION file

under /hdfs/temp/dfs/data/current folder:

hduser@host1:~/hdfs/tmp/dfs/data/current$ sudo gedit VERSION

143

Here is the content of VERSION file:

#Fri Jun 14 09:22:08 WET 2013


storageID=DS-1900366223-127.0.1.1-50010-1371201728420

cTime=0

storageType=DATA_NODE

layoutVersion=-32

Replace the namespaceID variable with the value you found in the VERSION file of the

master.

The content of file VERSION under /hdfs/temp/dfs/data/current folder is:

#Fri Jun 14 09:22:08 WET 2013


storageID=DS-1900366223-127.0.1.1-50010-1371201728420

cTime=0

storageType=DATA_NODE

layoutVersion=-32

Solution 2

1. Stop the whole cluster

2. Delete the data directory on the problematic DataNode: the directory is specified

by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory

is /hdfs/temp /dfs/data.

3. Reformat the NameNode.

4. Restart the cluster.

4. Safe mode exception when running MapReduce examples

org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException:

Cannot delete /benchmarks/TestDFSIO. Name node is in safe mode.

The reported blocks is only 3601 but the threshold is 0.9990 and the total blocks 3748. Safe mode will be turned

off automatically.

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:2111)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2088)

at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:832)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

Solution

hduser@master:~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave

Safe mode is OFF

hduser@master:~/hadoop-1.2.1$ bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean

TestDFSIO.0.0.4

144

References for Appendix C

[1]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-

cluster/

[2]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-

cluster/#solution-2-manually-update-the-namespaceid-of-problematic-datanodes

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-%20cluster/#solution-2-manually-update-the-namespaceid-of-problematic-datanodes

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-%20cluster/#solution-2-manually-update-the-namespaceid-of-problematic-datanodes

145

Appendix D: TeraSort and TestDFSIO Execution

1. TeraSort

1.1.Generate the TeraSort input data using TeraGen

TeraGen generates random data that can be conveniently used as input data for a subsequent

TeraSort run. The command to run TeraGen in order to generate 100 MB of input data is:

bin/hadoop jar hadoop-*examples*.jar teragen 1000000 /home/hduser/terasort-input

1000000 specifies the number of rows of input data to generate, each of which having a size

of 100 bytes.

1.2.Run the actual TeraSort benchmark using TeraSort

The syntax to run the TeraSort benchmark is as follows:

bin/hadoop jar hadoop-*examples*.jar terasort /home/hduser/terasort-input /home/hduser/terasort-output

1.3.Validate the sorted output data of TeraSort using TeraValidate

The syntax to run the TeraValidate is as follow:

bin/hadoop jar hadoop-*examples*.jar teravalidate /home/hduser/terasort-input /home/hduser/terasort-output

1. Check TeraSort Analysis

To check the average time to generate 100 MB, you need to run the following command:

bin/hadoop job -history /home/hduser/terasort-input

To check the average time to sort 100 MB, you need to run the following command:

bin/hadoop job -history /home/hduser/terasort-output

2. Clean up your temporary files

When re-running TeraSort Benchmark, you need to clean up all generated files in the first

TeraSort test.

bin/hadoop dfs -rmr /home/hduser/terasort-input

bin/hadoop dfs -rmr /home/hduser/terasort-output

146

2. TestDFSIO

2.1. Write data using TestDFSIO-Write

To generate 1000MB dataset, you need to specify an input with 10 files, and each file with

10MB. To allow this operation, the following command needs to be executed:

hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 10

A sample output of TestDFSIO-write operation provides information about the throughput,

average I/O rate, I/O rate standard deviation and test execution time.

13/11/07 15:37:27 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write

13/11/07 15:37:27 INFO fs.TestDFSIO:Date & time: Thu Nov 07 15:37:27 UTC 2013

13/11/07 15:37:27 INFO fs.TestDFSIO: Number of files: 10

13/11/07 15:37:27 INFO fs.TestDFSIO: Total MBytes processed: 100

13/11/07 15:37:27 INFO fs.TestDFSIO: Throughput mb/sec: 5.680527152919791

13/11/07 15:37:27 INFO fs.TestDFSIO: Average IO rate mb/sec: 9.899490356445312

13/11/07 15:37:27 INFO fs.TestDFSIO: IO rate std deviation: 7.567628183406918

13/11/07 15:37:27 INFO fs.TestDFSIO: Test exec time sec: 17.568

13/11/07 15:37:27 INFO fs.TestDFSIO:

2.2.Read data using TestDFSIO-Read

After getting the results of TestDFSIO-write command, the next step is to run TestDFSIO-

read operation. In this case, to read the previous generated data, the following command needs

to be executed.

hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 10

A sample output of write operation provides information about the throughput, average I/O

rate, I/O rate standard deviation and test execution time.

13/11/07 15:38:11 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read

13/11/07 15:38:11 INFO fs.TestDFSIO: Date & time: Thu Nov 07 15:38:11 UTC 2013

13/11/07 15:38:11 INFO fs.TestDFSIO: Number of files: 10

13/11/07 15:38:11 INFO fs.TestDFSIO: Total MBytes processed: 100

13/11/07 15:38:11 INFO fs.TestDFSIO: Throughput mb/sec: 70.57163020465772

13/11/07 15:38:11 INFO fs.TestDFSIO: Average IO rate mb/sec: 73.69004821777344

13/11/07 15:38:11 INFO fs.TestDFSIO: IO rate std deviation: 16.249892929638822

13/11/07 15:38:11 INFO fs.TestDFSIO: Test exec time sec: 15.51

13/11/07 15:38:11 INFO fs.TestDFSIO:

2.3.Clean your cluster

The last step is to clean up the generated data using the following command:

bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean

147

Appendix E: Data Gathering for TeraSort

1. Hadoop Physical Cluster

Number of

Machines

Dataset

Size Map Test 1 Test 2 Test 3 Mean

3

100 MB Map 6 7 6 6.33

100 MB Shuffling 10 10 11 10.33

100 MB Reduce 5 5 4 4.67

1 GB Map 13 14 16 14.33

1 GB Shuffling 83 81 85 83.00

1 GB Reduce 99 77 93 89.67

10 GB Map 31 22 19 24.00

10 GB Shuffling 1065 921 930 972.00

10 GB Reduce 1511 1841 1679 1677.00

30 GB Map 26 25 28 26.33

30 GB Shuffling 2971 2312 3081 2788.00

30 GB Reduce 9522 7544 8434 8500.00

4

100 MB Map 5 7 7 6.33

100 MB Shuffling 10 10 10 10.00

100 MB Reduce 4 6 4 4.67

1 GB Map 14 16 15 15.00

1 GB Shuffling 81 79 80 80.00

1 GB Reduce 99 87 82 89.33

10 GB Map 19 21 19 19.67

10 GB Shuffling 951 921 881 917.67

10 GB Reduce 1680 1714 1421 1605.00

30 GB Map 21 22 20 21.00

30 GB Shuffling 2860 2912 3120 2964.00

30 GB Reduce 5908 6412 6109 6143.00

5

100 MB Map 5 6 6 5.67

100 MB Shuffling 10 10 10 10.00

100 MB Reduce 5 5 5 5.00

1 GB Map 14 14 13 13.67

1 GB Shuffling 79 87 80 82.00

1 GB Reduce 84 93 83 86.67

10 GB Map 21 19 20 20.00

10 GB Shuffling 937 900 857 898.00

10 GB Reduce 1729 1611 1360 1566.67

30 GB Map 19 23 22 21.33

30 GB Shuffling 2446 2710 2650 2602.00

30 GB Reduce 5437 7118 6821 6458.67

100 MB Map 6 6 5 5.67

100 MB Shuffling 10 10 11 10.33

100 MB Reduce 4 4 5 4.33

148

1 GB Map 16 18 14 16.00

1 GB Shuffling 86 89 83 86.00

1 GB Reduce 91 75 73 79.67

10 GB Map 18 18 16 17.33

10 GB Shuffling 885 929 906 906.67

10 GB Reduce 1147 1515 1097 1253.00

30 GB Map 20 19 20 19.67

30 GB Shuffling 2731 2694 2725 2716.67

30 GB Reduce 6419 6210 5877 6168.67

7

100 MB Map 6 5 6 5.67

100 MB Shuffling 10 10 10 10.00

100 MB Reduce 5 5 4 4.67

1 GB Map 16 18 12 15.33

1 GB Shuffling 83 81 87 83.67

1 GB Reduce 85 83 80 82.67

10 GB Map 23 27 25 25.00

10 GB Shuffling 985 910 979 958.00

10 GB Reduce 1681 1591 1514 1595.33

30 GB Map 37 23 40 33.33

30 GB Shuffling 2983 2796 2882 2887.00

30 GB Reduce 6514 5891 5338 5914.33

8

100 MB Map 5 5 5 5.00

100 MB Shuffling 10 10 10 10.00

100 MB Reduce 5 4 5 4.67

1 GB Map 15 11 10 12.00

1 GB Shuffling 92 91 88 90.33

1 GB Reduce 80 76 75 77.00

10 GB Map 20 25 29 24.67

10 GB Shuffling 925 1020 893 946.00

10 GB Reduce 1043 1679 2092 1604.67

30 GB Map 27 24 30 27.00

30 GB Shuffling 2812 2777 2834 2807.67

30 GB Reduce 5319 6317 5395 5677.00

149

2. Hadoop Virtualized Cluster- KVM

Number of

KVM VMs

Dataset

Size Map Test 1 Test 2 Test 3 Mean

100 MB Map 4 6 5 5

100 MB Shuffling 7 7 7 7

100 MB Reduce 3 3 3 3

1 GB Map 12 14 12 12.67

1 GB Shuffling 37 37 38 37.33

3 1 GB Reduce 41 41 40 40.67

10 GB Map 24 20 23 22.33

10 GB Shuffling 781 737 718 745.33

10 GB Reduce 336 345 392 357.67

30 GB Map 24 24 23 23.67

30 GB Shuffling 2150 2220 2172 2180.67

30 GB Reduce 1559 1542 1539 1546.67

100 MB Map 5 5 5 5.00

100 MB Shuffling 6 7 7 6.67

100 MB Reduce 3 3 3 3.00

1 GB Map 12 15 16 14.33

1 GB Shuffling 28 34 38 33.33

4 1 GB Reduce 38 40 40 39.33

10 GB Map 28 29 23 26.67

10 GB Shuffling 657 672 657 662.00

10 GB Reduce 438 442 419 433.00

100 GB Map 25 28 25 26.00

100 GB Shuffling 1952 2046 1887 1961.67

100 GB Reduce 1616 1517 1605 1579.33

100 MB Map 5 5 5 5.00


100 MB Reduce 3 3 3 3.00

1 GB Map 61 64 85 70.00

1 GB Shuffling 113 109 139 120.33

5 1 GB Reduce 51 41 42 44.67

10 GB Map 33 29 32 31.33

10 GB Shuffling 746 632 877 751.67

10 GB Reduce 445 477 358 426.67

100 GB Map 37 66 51 51.33

100 GB Shuffling 3446 3332 2816 3198.00

100 GB Reduce 1413 1597 1788 1599.33

100 MB Map 5 5 4 4.67


100 MB Reduce 3 4 4 3.67

1 GB Map 224 343 266 277.67

1 GB Shuffling 511 464 492 489.00

6 1 GB Reduce 56 48 63 55.67

10 GB Map 45 37 42 41.33

150

10 GB Shuffling 1652 1387 1745 1594.67

10 GB Reduce 404 412 532 449.33

100 GB Map 140 180 50 123.33

100 GB Shuffling 7402 10197 5710 7769.67

100 GB Reduce 1717 1565 1206 1496.00

100 MB Map 5 5 5 5.00


100 MB Reduce 4 3 3 3.33

1 GB Map 124 245 365 244.67

1 GB Shuffling 1083 958 1344 1128.33

7 1 GB Reduce 102 121 81 101.33

10 GB Map 61 63 58 60.67

10 GB Shuffling 1024 1984 2062 1690.00

10 GB Reduce 985 1101 1024 1036.67

100 GB Map 185 163 154 167.33

100 GB Shuffling 12112 10197 12024 11444.33

100 GB Reduce 1987 1851 2106 1981.33

100 MB Map 5 5 5 5.00


100 MB Reduce 4 3 3 3.33

1 GB Map 162 193 167 174.00

1 GB Shuffling 1201 1320 1259 1260.00

8 1 GB Reduce 545.4 244.42 163.62 317.81

10 GB Map 104 121 97 107.33

10 GB Shuffling 2489.52 2440.32 2536.26 2488.70

10 GB Reduce 1211.55 1354.23 2283.52 1616.43

100 GB Map 201 195 168 188

100 GB Shuffling 11087 14587 13214 12962.667

100 GB Reduce 3088 3145 2906 3046.3333

151

3. Hadoop Virtualized Cluster- VMware ESXi

Number of

VMware VMs Dataset Size Map Test 1 Test 2 Test 3 Mean

100 MB Map 5 5 5 5


100 MB Reduce 4 4 4 4

1 GB Map 18 16 16 17

1 GB Shuffling 42 49 41 44

3 1 GB Reduce 40 38 39 39

10 GB Map 24 22 23 23

10 GB Shuffling 660 636 645 647

10 GB Reduce 492 483 493 489

30 GB Map 44 44 43 44

30 GB Shuffling 4108 3952 3891 3984

30 GB Reduce 2278 2315 2101 2231

100 MB Map 5 5 5 5


100 MB Reduce 4 4 4 4

1 GB Map 19 15 15 16


4 1 GB Reduce 42 41 40 41

10 GB Map 25 24 25 24.66667

10 GB Shuffling 672 691 682 682

10 GB Reduce 486 425 411 440.6667

30 GB Map 35 51 43 43

30 GB Shuffling 2657 3257 3214 3042.667

30 GB Reduce 1985 1852 1865 1900.667

100 MB Map 7 5 5 6


100 MB Reduce 4 3 3 3

1 GB Map 19 21 18 19


5 1 GB Reduce 39 35 37 37

10 GB Map 31 26 28 28

10 GB Shuffling 553 514 503 523

10 GB Reduce 418 432 421 424

30 GB Map 39 36 45 40

30 GB Shuffling 2540 2412 2286 2413

30 GB Reduce 2310 2245 2101 2219

100 MB Map 5 6 5 5


100 MB Reduce 5 4 4 4

1 GB Map 18 18 17 18


6 1 GB Reduce 32 29 34 32

10 GB Map 59 42 41 47

152

10 GB Shuffling 536 552 529 539

10 GB Reduce 369 385 336 363

30 GB Map 30 32 28 30

30 GB Shuffling 2412 2254 2114 2260

30 GB Reduce 2098 1671 1658 1809

100 MB Map 10 10 8 9


100 MB Reduce 4 4 4 4

1 GB Map 24 29 26 26


7 1 GB Reduce 26 34 25 28

10 GB Map 52 56 52 53

10 GB Shuffling 536 520 511 522

10 GB Reduce 298 290 302 297

30 GB Map 84 76 87 82

30 GB Shuffling 3210 2687 2968 2955

30 GB Reduce 1743 1523 1621 1629

100 MB Map 17 16 11 15

100 MB Shuffling 15 16 14 15

100 MB Reduce 4 4 4 4

1 GB Map 81 79 81 80


8 1 GB Reduce 36 36 37 36

10 GB Map 128 102 127 119

10 GB Shuffling 1340 1102 1021 1154

10 GB Reduce 509 562 554 542

30 GB Map 144 137 142 141

30 GB Shuffling 4481 4251 4012 4248

30 GB Reduce 1753 1578 1697 1676

153

Appendix F: Data Gathering for TestDFSIO

1. Hadoop Physical Cluster

Number of Nodes = 3

Dataset

Size

Operatio

n Criteria Test1 Test2 Test3 Mean

100 MB

Write

Throughput (mb/sec) 2.867 2.861 2.421 2.72

Average IO rate (mb/sec) 2.903 2.957 2.517 2.79

IO rate standard deviation 0.363 0.505 0.486 0.45

Execution time (sec) 17.786 16.717 18.8 17.77

Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





154

Number of Nodes = 4

Dataset

Size Operation Criteria Test1 Test2 Test3 Mean

100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





155

Number of Nodes = 5

Data


100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write




Execution time (sec) 3552.118 3478.991 3177.00

1

3402.7

0

Read



IO rate standard deviation 2462.739 2243.086 2558.05

1

2421.2

9


156

Number of Nodes = 6

Dataset


100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB





3

3299.2

5





4

2339.6

5

300 GB

Write





Read





157

Number of Nodes = 7

Data


100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





3

Read





4

2150.8

1

300 GB

Write





Read





158

Number of Nodes = 8

Dataset


100 MB

Write





5

Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





300 GB

Write





Read





159

2. Hadoop Virtualized Cluster- KVM

Number of KVM VMs = 3

Data




100 MB IO rate standard deviation 9.399 15.769 16.049 1.63

Write Execution time (sec) 15.439 15.405 13.405 14.75



Read IO rate standard deviation 7.777 22.154 12.986 4.15




1 GB IO rate standard deviation 3.808 1.115 6.539 3.82

















100 GB Write Execution time (sec) 2929.379 2666.541 2812.221 2802.71





160


Data


































161


Data



Write Average IO rate (mb/sec) 6.55 5.447 3.682 5.23

100 MB









1 GB Write IO rate standard deviation 1.699 0.375 0.527 0.87

















Execution time (sec) 2679.211 2850.74

4 2824.512 2784.82

Throughput (mb/sec) 11.655 11.181 11.269




162


Data


























Write





Read





163


Data


































164


Data Size Operation Criteria Test1 Test2 Test3 Mean




100 MB









1 GB








Write IO rate standard deviation 0.213 0.215 0.2 0.21

10 GB






Throughput (mb/sec)

Write Average IO rate (mb/sec)

IO rate standard deviation

100 GB

Execution time (sec)

Throughput (mb/sec)

Average IO rate (mb/sec)

Read IO rate standard deviation

Execution time (sec)

165

4. Hadoop Virtualized Cluster- VMware ESXi

Number of VMware ESXi VMs = 3

Dataset


100 MB

Write






Read




1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





166


Dataset


100 MB

Write

Throughput (mb/sec) 3.621 3.256 4 3.63




Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





167


Dataset

Size Operation Cretiria Test1 Test2 Test3 Mean

100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





168


Dataset


100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





169


Dataset


100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





170


Dataset

Size Operation Cretiria Test1 Test2 Test3 Mean

100 MB

Write





Read





1 GB

Write





Read





10 GB

Write





Read





100 GB

Write





Read





the impact of virtualization on high performance computing ...r.abid/research/achahbar... · high...

Documents