the impact of virtualization on high performance computing ...r.abid/research/achahbar... · high...
TRANSCRIPT
The Impact of Virtualization on High Performance
Computing Clustering in the Cloud
Master Thesis Report
Submitted in
Fall 2013
In partial fulfillment of the requirements for the degree of
Master of Science in Software Engineering at the School of
Science and Engineering of Al Akhawayn University in Ifrane
By
Ouidad ACHAHBAR
Supervised by
Dr. Mohamed Riduan ABID
Ifrane, Morocco
January, 2014
2
Acknowledgment
I would like to express my deepest and sincere gratitude to ALLAH for giving me guidance
and strength to complete this work, and for having the chance to study and accomplish my
master degree with high support from my family, friends and professors. Thank you ALLAH.
I would also like to deeply thank my supervisor Dr. Abid for trusting me to conduct this
research, providing me with valuable feedback and overseeing my progress in a weekly basis.
Thank you Dr. Abid for your motivation and support.
My gratitude also goes to Dr. Haitouf who provided me with valuable comments and shared
with me his knowledge in cloud computing and distributed systems. Thank you Dr. Haitouf.
I am most thankful to my dear parents, brothers, sisters, nephews and fiancé for their
continuous support, encouragement and love. There are no words to express my gratitude to
all of you.
Many thanks go to my very close friends: Nora El Bakraoui Alaoui, Inssaf El Boukari, Sara
El Alaoui, Aida Tahditi, Jamila Barroug, Wafa Bouya and Chahrazad Touzani. Thank you for
being always by my side; thank you for sharing enjoyable moments with me, and thank you
for being my friends.
Last but not least, special acknowledgements go to all my professors for their support, respect
and encouragement. Thank you Ms. Hanaa Talei, Ms. Asmaa Mourhir, Dr. Naeem Nizar
Sheikh, Mr. Omar Iraqui, Dr. Violetta Cavalli Sforza, Dr. Kevin Smith and Dr. Harroud.
Ouidad Achahbar
3
Abstract
The ongoing pervasiveness of Internet access is largely increasing big data production. This,
in turn, increases demand on compute power to process the massive data, and thus rendering
High Performance Computing (HPC) into a high solicited service.
Based on the paradigm of providing computing as a utility, the cloud is offering user-friendly
infrastructures for processing these big data, e.g., High Performance Computing as a Service
(HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization
technique since the latter controls the creation of virtual machines instances that carry data
processing jobs.
In this thesis, we characterize and evaluate the impact of machine virtualization on HPCaaS.
We track HPC performance under different cloud virtualization platforms, namely KVM and
VMware ESXi, and compare it to the performance in a physical computing cluster
infrastructure. The virtualized environment is deployed using Hadoop on top of Openstack.
The resulting HPCaaS runs MapReduce algorithms on benchmarked big data samples using a
granularity of 8 physical machines per cluster.
We got several interesting results when we ran the selected benchmarks on virtualized and
physical cluster. Each tested cluster provided different performance trends. Yet, the overall
analysis of the research findings proved that the selection of virtualization technology can
lead to significant improvements when running and handling HPCaaS.
4
ملخص
. هذا بدوره الضخمة اناتعديد من البيالفي تزايد إنتاج اإلنترنت سببا رئيسيا واستعمال وجيعتبر التفشي المستمر لظاهرة ول
ية "حوسبة عالمن خدمة ه المؤشرات جعلت ذ. هالبيانات هذه لمعالجةقدرات حسابية عالية يؤدي إلى زيادة الطلب على
مثيرة لإلهتمام. كخدمةاألداء"
عالجة البيانات السحابية بنيات تحتية مرنة اإلستعمال لم الحوسبة تقدمستنادا إلى نموذج توفير الحوسبة كأداة مساعدة، ا
البيئة قنية بت كبيربشكل ةاألخير ه". مع ذلك، يقترن أداء هذكخدمةعالية األداء الحوسبة ال، على سبيل المثال، "الضخمة
لبيانات. الجة ابوظائف مع تقومالتي ( فتراضيةاال الحواسب)األالت االفتراضية شاء في إن هاتحكماالفتراضية نظرا إلى
ا أيضا بتتبع أداء االفتراضية على "الحوسبة العالية األداء كخدمة". قمنالبيئة في هذه األطروحة، قمنا بوصف و تقييم تأثير
. قمنا يوترثمان أجهزة كمبة من نمكوى حوسبة مادية ة وعلمختلف برامج سحابية افتراضية على"الحوسبة العالية األداء"
لى باستخدام "أوبن ستاك" لبناء "الحوسبة العالية األداء كخدمة"، و "هادوب" لتشغيل خوارزميات "ماب رديوس" ع
.كبيرةبيانات
ةبنيال) ةحوسبت، نوعية المن خالل نتائج هذا البحث، الحظنا تغير مهم في أداء " الحوسبة العالية األداء" بتغير حجم البيانا
ية البيئة االفتراضية ان تقن الذي وصلنا اليه يثبت االستناجف بالرغم من ذالك،وحجم الحوسبة. المادية واالفتراضية( :تحتيةال
.بر في تحسين أداء "الحوسبة العالية األداء"تلها دور مهم ومع
5
Table of Content
Acknowledgment 2
Abstract 3
4 ملخص
Table of Content 5
List of Figures 7
List of Tables 9
List of Appendices 10
List of Acronyms 11
PART I: THESIS OVERVIEW 12
Chapter 1: Introduction 13
1.1. Background 13 1.2. Motivation 14 1.3. Problem Statement 15 1.4. Research Question 15 1.5. Research Objective 15 1.6. Research Approach 15 1.7. Thesis Organization 16
PART II: THEORETICAL BASELINES 17
Chapter 2: Cloud Computing 18
3.1. Cloud Computing Definition 18 3.2. Cloud Computing Characteristics 19 3.3. Cloud Computing Service Models 20 3.4. Cloud Computing Deployment Models 21 3.5. Cloud Computing Benefits 22 3.6. Cloud Computing Providers 23
Chapter 3: Virtualization 24
4.1. Definition of Virtualization 24 4.2. History of Virtualization 25 4.3. Benefits of Virtualization 25 4.4. Virtualization Approaches 26 4.5. Virtual Machine Manager 28
Chapter 4: Big Data and High Performance Computing as a Service 32
5.1. Big Data 32 5.2. High Performance Computing as a Service (HPCaaS) 33
Chapter 5: Literature Review and Research Contribution 35
5.1. Related Work 35 5.2. Contribution 36
PART III: TECHNOLOGY ENABLERS 37
Chapter 6: Technology Enablers Selection 38
6.1. Cloud Platform Selection 38 6.2. Distributed and Parallel System Selection 40
6
Chapter 7: Openstack 42
7.1. OpenStack Overview 42 7.2. OpenStack History 42 7.3. OpenStack Components 43 7.4. OpenStack Supported Hypervisors 49
Chapter 8: Hadoop 50
8.1. Hadoop Overview 50 8.2. Hadoop History 50 8.3. Hadoop Architecture 51 8.4. Hadoop Implementation 52 8.5. Hadoop Cluster Connectivity 55
PART III: RESEARCH CONTRIBUTION 57
Chapter 9: Research Methodology 58
9.1. Research Approach 58 9.2. Research Steps 58
Chapter 10: Experimental Setup 59
10.1. Experimental Hardware 59 10.2. Experimental Software and Network 60 10.3. Clusters Architecture 60 10.4. Experimental Performance Benchmarks 64 10.5 Experimental Datasets Size 65 10.6 Experiment Execution 66
Chapter 11: Experimental Results 67
11.1. Hadoop Physical Cluster Results 67 11.2. Hadoop Virtualized Cluster- KVM Results 72 11.3. Hadoop Virtualized Cluster- VMware ESXi Results 77 11.4. Results Comparison 82
Chapter 12: Discussion 88
12.1. TeraSort 88 12.2. TestDFSIO 90 12.3. Conclusion 91
PART IV: CONCLUSION 92
Chapter 13 93
Conclusion and Future Work 93
Bibliography 94
Appendix A: OpenStack with KVM Configuration 100
Appendix B. OpenStack with VMware ESXi Configuration 127
Appendix C: Hadoop Configuration 131
Appendix D: TeraSort and TestDFSIO Execution 145
Appendix E: Data Gathering for TeraSort 147
Appendix F: Data Gathering for TestDFSIO 153
7
List of Figures
Figure 1: Thesis organization ................................................................................................................ 16
Figure 2: NIST visual model of cloud computing definition ................................................................ 19
Figure 3: services provided in cloud computing environment .............................................................. 21
Figure 4: Full virtualization architecture .............................................................................................. 26
Figure 5: Paravirtualization architecture .............................................................................................. 27
Figure 6: Hardware assisted virtualization architecture ....................................................................... 28
Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor ........................................................................... 29
Figure 8: Xen hypervisor architecture ................................................................................................... 30
Figure 9: KVM hypervisor architecture ................................................................................................ 31
Figure 10: VMware ESXi architecture ................................................................................................. 31
Figure 11: Data growth over 2008 and 2020 ........................................................................................ 32
Figure 12: Active cloud community population .................................................................................... 38
Figure 13: Active distributed systems population ................................................................................. 40
Figure 14: OpenStack conceptual architecture ..................................................................................... 44
Figure 15: Nova subcomponents ........................................................................................................... 44
Figure 16: Glance subcomponents ........................................................................................................ 46
Figure 17: Keystone subcomponents ..................................................................................................... 46
Figure 18: Swift subcomponents ........................................................................................................... 47
Figure 19: Cinder subcomponents ......................................................................................................... 48
Figure 20: Quantum subcomponents ..................................................................................................... 48
Figure 21: Apache Hadoop subprojects ............................................................................................... 51
Figure 22: Hadoop Architecture ............................................................................................................ 52
Figure 23: HDFS and MapReduce representation ................................................................................. 53
Figure 24: Word count MapReduce example ....................................................................................... 55
Figure 25 : Research steps ..................................................................................................................... 58
Figure 26 : Hadoop Physical Cluster ..................................................................................................... 61
Figure 27: Hadoop Physical Cluster architecture .................................................................................. 61
Figure 28: Hadoop virtualized cluster - KVM ...................................................................................... 62
Figure 29: Hadoop virtualized cluster – VMware ESXi (a) .................................................................. 63
Figure 30 : Hadoop virtualized cluster – VMware ESXi (b) ................................................................. 64
Figure 31 : Experimental execution ...................................................................................................... 66
Figure 32: TeraSort performance on Hadoop Physical Cluster ............................................................ 67
Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster ........................................ 68
Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster............................................. 68
Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster........................................... 68
Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster........................................... 68
Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster .............................................. 69
Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster ............................ 70
Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster ......................... 70
Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster ........................... 70
Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster .......................... 70
Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster ............................................... 71
Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster .......................... 71
Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster ............................. 71
Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster ............................. 72
Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster .......................... 72
Figure 47: TeraSort performance on Hadoop KVM Cluster ................................................................. 72
8
Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster ............................................ 73
Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster ................................................. 73
Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster ................................................ 73
Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster ............................................... 73
Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster ................................................. 74
Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster .............................. 75
Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster ................................... 75
Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster ................................. 75
Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster .............................. 75
Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster .................................................... 76
Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster ............................... 76
Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster .................................... 76
Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster ................. 77
Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ................ 77
Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster ................................................. 77
Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster ............................. 78
Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster ................................... 78
Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster ............................... 78
Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster ................................. 78
Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster ................................... 79
Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster .............. 80
Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster .................... 80
Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster ................ 80
Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster ............... 80
Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster ................................... 81
Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster .............. 81
Figure 74 : TestDFSIO-Read performance for 1 GB on Hadoop VMware ESXi Cluster .................... 81
Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster ................ 82
Figure 76 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ................ 82
Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and ................................... 83
Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi................ 83
Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and ...................................... 84
Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and ...................................... 84
Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi ................ 85
Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi ............. 85
Figure 83: Average time for wrting 100 GB on HPhC, HVC with KVM ............................................. 86
Figure 84: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi .................. 86
Figure 85 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi ....... 86
Figure 86: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi .... 87
Figure 87 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87
Figure 88: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs .... 89
Figure 89 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi VMs89
Figure 90: OpenStack warning statistics about system’ resources usage .............................................. 90
9
List of Tables
Table 1 : A Comparison of cloud deployment models ......................................................................... 22
Table 2 : Cloud IaaS selection ............................................................................................................... 39
Table 3 : Parallel and distributed platform selection ............................................................................. 41
Table 4 : OpenStack releases ................................................................................................................ 43
Table 5 : OpenStack projects ................................................................................................................. 43
Table 6: Apache Hadoop subprojects .................................................................................................... 51
Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) ............................. 59
Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster ............. 60
Table 9 : OpenStack virtual machines’ features .................................................................................... 60
Table 10 : Experimental performance metrics ...................................................................................... 64
Table 11 : Datasets size used for Hadoop benchmarks ......................................................................... 65
Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different
number of nodes- Hadoop Physical Cluster .......................................................................................... 67
Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and
different number of nodes- Hadoop Physical Cluster ........................................................................... 69
Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and
different number of nodes- Hadoop Physical Cluster ........................................................................... 71
Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different
number of nodes- Hadoop KVM Cluster .............................................................................................. 72
Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and
different number of nodes- Hadoop KVM Cluster ................................................................................ 74
Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop KVM Cluster ................................................................................ 76
Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop VMware ESXi Cluster ................................................................. 77
Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and
different number of nodes- Hadoop VMware ESXi Cluster ................................................................. 79
Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop VMware ESXi Cluster ................................................................. 81
10
List of Appendices
Appendix A : OpenStack with KVM Configuration……………………………………………...….100
Appendix B : OpenStack with VMware ESXi Configuration……………………………………….127
Appendix C: Hadoop Configuration………………………………………………….....……………131
Appendix D: TeraSort and TestDFSIO Execution…………………………………….… ………….145
Appendix E: Data Gathering for TeraSort……………………………………………..……………..147
Appendix F: Data Gathering for TestDFSIO…………………………………………………………153
11
List of Acronyms
HPC High Performance Computing
HPCaaS High Performance Computing as a Service
VM Virtual Machine
VMM Virtual Machine Manager
EMC American Multinational Corporation
DCI Digital Communications Inc.
GFS Google File System
HDFS Hadoop Distributed File System
NDFS Nutch Distributed File System
DOE Department of Energy National Laboratories
NIST National Institute of Standards and Technology
SaaS Software as a Service
PaaS Platform as a Service
IaaS Infrastructure as a Service
NoSQL Not Only Structured Query Language
SNIA Storage Networking Industry Association
ACID Atomicity, Consistency, Isolation and Durability
AWS Amazon Web Services
HPhC Hadoop Physical Cluster
HVC Hadoop Virtualized Cluster
SSH Secure Shell
JSON JavaScript Object Notation
XML Extensible Markup Language
API Application Programming Interface
Amazon EC2 Amazon Elastic Compute Cloud
Amazon S3 Amazon Simple Storage Service
VLAN Virtual Local Area Network
DHCP Dynamic Host Configuration Protocol
12
Part I: Thesis Overview
This part introduces the key points to understand the purpose of the present research. It
provides an introduction of the research starting with its background, motivation, problem
statement, research question, objective and research methodology.
13
Chapter 1: Introduction
In this chapter, we first come to the background of the present research, and then describe the
motivation and the problem behind conducting this study. After that, questions, objectives,
and methodology of the research are stated. Finally, an outline of the thesis is given out at the
end of this chapter.
1.1.Background
During the last decades, the demand for computing power has steadily increased as data
generated from social networks, web pages, sensors, online transactions, etc. is continuously
growing. A study done in 2012 by American Multinational Corporation (EMC), has estimated
that from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000
exabytes), and therefore, digital data will be doubled every two years [1]. The growth of data
constitutes the “Big Data” phenomenon.
As Big Data grows in terms of volume, velocity and value, the current technologies for
storing, processing and analyzing data become inefficient and insufficient. Gartner survey
stated that data growth is considered as the largest challenge for organizations [2]. Stating this
issue, High Performance Computing (HPC) has started to be widely integrated in managing
and handling Big Data. In this case, HPC is used to process and analyze Big Data related to
different problems including scientific, engineering and business problems that require high
computation capabilities, high bandwidth, and low latency network [3].
However, HPC still lacks the toolsets that fit the current growth of data. In this case, new
paradigms and storage tools were integrated with HPC to deal with the current challenges
related to data management. Some of these technologies include, providing computing as a
utility (cloud computing) and introducing new parallel and distributed paradigms.
Cloud computing plays an important role as it provides organizations with the ability to
analyze and store data economically and efficiently. Performing HPC in the cloud was
introduced as data has started to be migrated and managed in the cloud. Digital
Communications Inc. (DCI) stated that by 2020, a significant portion of digital data will be
managed in the cloud, and even if a byte in the digital universe is not stored in the cloud, it
will pass, at some point, through the cloud [4]. Performing HPC in the cloud is known as
High Performance Computing as a Service (HPCaaS). In short, HPCaaS offers high-
14
performance, on-demand, and scalable HPC environment that can handle the complexity and
challenges related to Big Data [5].
One of the most known and adopted parallel and distributed systems is MapReduce model
that was developed by Google to meet the growing of their web search indexing process [6].
MapReduce computations are performed with the support of data storage system known as
Google File System (GFS). The success of both Google File System and MapReduce inspired
the development of Hadoop which is a distributed and parallel system that implements
MapReduce and Hadoop Distributed File System (HDFS) [7]. Nowadays, Hadoop is widely
adopted by big players in the market because of its scalability, reliability and low cost of
implementation. Stating this, Hadoop is also proposed to be integrated with HPC as an
underlying technology that distributes the work across HPC cluster [8, 9].
1.2.Motivation
Many solutions have been proposed and developed to improve computation performance of
Big Data. Some of them tend to improve algorithms efficiency, provide new distributed
paradigms or develop powerful clustering environments. Though, few of those solutions have
addressed a whole picture of integrating HPC with the current emerging technologies in terms
of storage and processing.
As stated before, some of the most popular technologies currently used in hosting and
processing Big Data are cloud computing, HDFS and Hadoop MapReduce[10]. At present,
the use of HPC in the cloud computing is still limited. The first step towards this research was
done by the Department of Energy National Laboratories (DOE), which started exploring the
use of cloud services for scientific computing [11]. Besides, in 2009, Yahoo Inc. launched
partnership with major top universities in United States to conduct more research about cloud
computing, distributed systems and high computing applications.
HPCaaS still needs more investigation to decide on appropriate environments that can fit high
computing requirements. One of the HPCaaS’ aspects that is not yet investigated is the impact
of different virtualization technologies on HPC in the cloud. Therefore, the motivation of this
research consists in the need for evaluating HPCaaS performance using MapReduce and
different virtualization techniques. This motivation is accompanied by a strong rational that is
addressed by the free accessibility to MapReduce and cloud computing open sources.
15
1.3.Problem Statement
Cloud computing is offering set of services for processing Big Data; one of these services is
HPCaaS. Still, HPCaaS performance is highly affected by the underlying virtualization
techniques which are considered as the heart of cloud computing. Stating this, the problem
addressed in this research is formulated as follow: “HPCaaS is still facing poor performance
and still doesn’t fit Big Data requirements”.
1.4.Research Question
Addressing the problem statement, this thesis aims at bringing answers to the following
research questions:
1. What is the performance of HPC on Hadoop Physical Cluster (HPhC)?
2. Is it worth moving HPC to the cloud?
3. How virtualization techniques affect HPCaaS performance?
4. Is there an optimal virtualization technique that can ensure good performance?
1.5.Research Objective
The purpose of the present research is to find solutions for the addressed issues and questions
in the previous sections. Hence, this research introduces a new architecture that can handle
HPC complexity and increase its performance. The proposed architecture consists of building
a Hadoop Virtualized Cluster (HVC) in a private cloud using OpenStack. Hence, the first goal
of this research is to investigate the added value of adopting virtualized cluster, and the
second goal is to evaluate the impact of virtualization techniques on HPCaaS.
1.6.Research Approach
To evaluate HPCaaS over different virtualization technologies, we followed both qualitative
and quantitative research methodologies. The qualitative approach was adopted to select
appropriate technology enablers that will be used in building an architecture that will solve
the issues addressed in this study. On the other hand, quantitative approach was adopted to
conduct different experiments on three different clusters: Hadoop Physical Cluster (HPhC),
Hadoop Virtualized Cluster using KVM (HVC- KVM) [12] and Hadoop Virtualized Cluster
using VMware ESXi (HVC- VMware ESXi) [13]. Each experiment tends to measure the
performance of HPC.
16
1.7.Thesis Organization
The rest of this thesis is structured as follow (Figure 1):
Part I covers chapter 1 (current chapter) which introduces the present research.
Part II covers chapter 2, 3, 4 and 5. Chapter 2 provides basic understanding of cloud
computing; chapter 3 introduces virtualization; chapter 4 presents the concept of Big
Data and HPCaaS, and chapter 5 lists some related work and states clearly our
contribution
Part III covers chapter 6, 7 and 8. Chapter 6 explains the steps we followed in selecting
the technology enablers of this research, and chapter 7 and 8 present in details OpenStack
and Hadoop respectively.
Part IV covers chapter 9, 10, 11 and 12. Chapter 9 presents the methodology adopted in
conducting this research; chapter 10 demonstrates the environment preparation to run the
needed experiments; chapter 11 introduces the results, and chapter 12 discusses the
research findings.
Part V covers chapter 13 which concludes the research findings and proposes some future
work; further, this part includes bibliography and appendices of this study.
Figure 1: Thesis organization
17
Part II: Theoretical Baselines
The objective of this part is to elaborate and shed light on some scientific concepts, theories
and topics that serve as a foundation to understand the whole picture of the present research.
Hence; this part is structured as follow: chapter 2 demonstrates basic background of cloud
computing; chapter 3 introduces cloud computing related technologies, namely virtualization;
chapter 4 presents Big Data and HPaaS, and chapter 5 situates this research by introducing
previous research that were done in the domain of evaluating HPC.
18
Chapter 2: Cloud Computing
Cloud computing becomes the current innovative and emerging trend in delivering IT services
that attract both the interest of academic and industrial fields. Using advanced technologies,
cloud computing provides end users with a variety of services, starting from the hardware
level services to the application level. Cloud computing is understood as utility computing
over the Internet. Meaning, computing services have moved from local data centers to hosted
services which are offered over the Internet and paid based on pay-per-use model [14]. This
chapter provides an overview of cloud computing concept. It provides a distinct definition of
what cloud computing is; defines cloud computing characteristics, describes cloud service and
deployment models, discusses some cloud computing benefits, and finally this chapter lists
some cloud computing providers.
3.1.Cloud Computing Definition
In the late 1960’s, John McCarthy brought a new concept into computer science field which
predicts that technology will not be only provided as tangible products [14]. Meaning,
computer resources will be provided as a service like water and electricity. The concept was
known as utility computing, and nowadays it known as cloud computing.
Cloud computing is defined by NIST (National Institute of Standards and Technology) [15] in
2009 as:
“Cloud computing is a model for enabling ubiquitous,
convenient, on-demand network access to a shared
pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services)
that can be rapidly provisioned and released with
minimal management effort or service provider
interaction. This cloud model is composed of five
essential characteristics, three service models, and
four deployment models. ”
NIST definition of cloud sheds light on the effective use of cloud computing in terms of
providing minimum management efforts of the shared resources. It sets five characteristics
that define cloud computing: on-demand self-service, broad network access, resource pooling,
rapid elasticity and measured service. Concerning the deployment models, NIST has
classified them into: private, public, community and hybrid cloud. More details about cloud
characteristics, delivery and deployment models are provided in the upcoming subsections.
19
The NIST definition of cloud is summarized in Figure 2 which encapsulates cloud computing
characteristics, service models, and deployment models.
Figure 2: NIST visual model of cloud computing definition [14]
3.2.Cloud Computing Characteristics
NIST has listed five main characteristics that describe precisely cloud computing, which are
[15]:
On-demand self-service: end users can use and change computing capabilities as desired
without the need of human interaction with each service provider.
Broad network access: resources are accessed over network using standards mechanism.
Resource pooling: the provider’s computing resources are pooled to serve multiple
consumers; these resources are dynamically assigned and reassigned according to
consumer demand. Examples of resources include storage, processing, memory, and
network bandwidth.
Rapid elasticity: cloud providers can elastically scale in and scale out resources
depending on current end users’ demand. Therefore, resources can be available for
provisioning in any quantity at any time.
Measured service: resources usage can be monitored, controlled and measured;
therefore, these features enable end users to pay using the pay as you go model.
Other characteristics were investigated in [16], and which are listed as follow:
20
Reliability: this feature is ensured by implementing and providing multiple redundant
sites. Having this feature, cloud computing is considered as an ideal solution for disaster
recovery and business critical tasks.
Customization: cloud computing allows customization of infrastructure and applications
based on end user’ demand.
Efficient resource utilization: this feature ensures delivering resources as long as they
are needed.
3.3. Cloud Computing Service Models
Based on NIST definition of cloud computing, cloud deployment models are classified as
follow:
Software as a Service (SaaS)
Software as a Service (SaaS) represents application software, operating system and computing
resources. End users can view the SaaS model as a web-based application interface where
services and complete software applications are delivered over the Internet. Some examples of
SaaS applications are: Google Docs, Microsoft Office Live, Salesforce Customer Relationship
Management, etc.
Platform as a Service (PaaS)
This service allows end users to create and deploy applications on provider’s cloud
infrastructure. In this case, end users do not manage or control the underlying cloud
infrastructure like network, servers, operating systems, or storage. However, they do have
control over the deployed applications by being allowed to design, model, develop and test
them. Examples of PaaS are: Google App Engine, Microsoft Azure, Salesforce, etc.
Infrastructure as a Service (IaaS)
This service consists of a set of virtualized computing resources such as network bandwidth,
storage capacity, memory, and processing power. These resources can be used to deploy and
run arbitrary software which can include operating systems and applications. Examples of
IaaS providers are Drop Box, Amazon web service, etc.
Cloud services are summarized in Figure 3.
21
Figure 3: services provided in cloud computing environment [16]
3.4.Cloud Computing Deployment Models
Private Cloud
Private cloud computing is provisioned for exclusive use by an organization. The cloud in this
case is owned, managed and operated by the organization, a third party, or both of them. The
advantage of private cloud consists in providing high security since the cloud is accessed by
trusted entities within the organization [15].
Public Cloud
The cloud infrastructure is provisioned for general public use. It may be owned, managed, and
operated by cloud service provider who offers services based on pay-per-use model. In
contrast to private cloud, public cloud is known as untrustworthy environment [15].
Community Cloud
The cloud infrastructure is provisioned for exclusive use by a specific community of
consumers from different organizations that share some goals (e.g., mission, security
requirements, policy, and compliance considerations). In this case, the cloud may be owned,
managed, and operated by one or more organizations in the community, a third party, or
combination of them [15].
Hybrid Cloud
This cloud is a combination of both private and public cloud computing environments. Hybrid
cloud provides high flexibility and choices for organization; for instance, critical core
activities of an organization can be run under the control of the private part of the hybrid
cloud while other tasks may be outsourced to the public part [17].
Table 1 summarizes cloud deployment models discussed above [17].
22
Table 1 : A Comparison of cloud deployment models [17]
3.5.Cloud Computing Benefits
Nowadays, cloud is widely used because of the benefits it provides to end users. Some of the
key benefits offered by the cloud include [17, 18]:
Initial Cost Savings
Organizations or individuals can save the big initial investment for launching new hardware,
products and services; in this case, cloud computing platform offers the needed resources in
terms of infrastructure, platform and applications.
Scalability
Cloud computing ensures high computing scalability by scaling up resources as needed.
Therefore, when the usage increases, resources increase relatively to respond to end user’
demand.
Availability
Cloud providers have the infrastructure and bandwidth to accommodate business
requirements for high speed access, storage and systems.
Reliability
Cloud computing implements redundant paths to support business continuity and disaster
recovery.
23
Maintenance
End users are not concerned with the resources maintenance since it is done by the cloud
service provider.
3.6.Cloud Computing Providers
There are many providers who offer cloud services with different features and pricing. Some
of them are listed as follow [16, 19]:
Amazon Web Services
Amazon (AWS) [20] offers a number of cloud services for all business sizes. AWS ensures
advanced data privacy techniques to protect users’ data. For that reason, AWS got various
security certifications and audits such as ISO 27001, FISMA moderate and SAS 70 Type II.
Some AWS services are: Elastic Compute Cloud, Simple Storage Service, SimpleDB
(relational data storage service that stores, processes and queries data sets in the cloud), etc.
Google [21] offers high accessibility and usability in its cloud services. Some of Google
services include: Google’s App Engine, Gmail, Google Docs, Google analytics, Picasa (a tool
used to exhibit product and uploading their images in the cloud), etc.
Microsoft
Microsoft [22] offers a famous cloud platform called Windows Azure which runs Windows
applications. Some other services include: SQL Azure, Windows Azure Marketplace (an
online market to buy and sell applications and data), etc.
OpenStack
OpenStack [23] is an open source platform for public and private cloud computing that aims
at ensuring scalability and flexibility. It was founded by Rackspace hosting and NASA.
Some other organizations that invest in the cloud are: Dell, IBM, Oracle, HP, Sales force, etc.
[16].
24
Chapter 3: Virtualization
There are many different existing technologies and practices used by cloud providers; some of
them are internet protocols for communication, virtual private cloud provisioning, load
balancing and scalability, distributed processing, high performance computing technologies
and virtualization [24]. This chapter emphasizes an understanding of virtualization technology
as it is considered the core of cloud computing. It describes in details the history, benefits,
types and the abstract layer of virtualization.
4.1.Definition of Virtualization
Virtualization is a widely used term; it has been introduced for many years as a powerful
technology in computer science. The definition of virtualization can change depending on
which component of computer system is applied on. However, it is broadly defined as an
abstract layer between physical resources and their logical representation [25]. NIST has
defined virtualization as [26]:
Furthermore, Virtualization is defined by SNIA (Storage Networking Industry Association) as
follow [27]:
From both definitions, we can say that virtualization is a methodology of dividing a physical
machine into multiple execution environments that allow multiple tasks to run
simultaneously. This is done by providing a software abstract layer that is called Virtual
“The simulation of the software and/or hardware upon which other
software runs. This simulated environment is called a virtual machine
(VM). There are many forms of virtualization, distinguished primarily by
computing architecture layer. For example, application virtualization
provides a virtual implementation of the application programming
interface (API) that a running application expects to use, allowing
applications developed for one platform to run on another without
modifying the application itself. The Java Virtual Machine (JVM) is an
example of application virtualization; it acts as an intermediary between
the Java application code and the operating system (OS). Another form of
virtualization, known as operating system virtualization, provides a
virtual implementation of the OS interface that can be used to run
applications written for the same OS as the host, with each application in
a separate VM container”.
“The act of abstracting, hiding, or isolating the internal functions of a
storage (sub) system or service from applications, host computers, or general
network resources, for the purpose of enabling application and network-
independent management of storage or data”.
25
Machine Manager (VMM) or Hypervisor. VMM is therefore designed to hide the physical
resources from the operating system. In this case, VMM allows creating multiple guest
Operating Systems (OS) (each guest is run by software units called Virtual Machines (VM)
[28].
4.2.History of Virtualization
The roots of virtualization go back to the first visualized IBM mainframes that were designed
in the 1690s, and which allowed the company to run multiple applications and processes
simultaneously. In fact, the main drivers behind introducing virtualization were the high cost
of hardware and the need for running and isolating applications on the same hardware. During
1970s, the adoption of virtualization technology increased sharply in many organizations
because of cost effectiveness. However, in 1980s and 1990s, hardware prices dropped down
as well as the emergence of multitasking operating systems. With these facts, there was no
need to assure a high CPU utilization, and therefore, there was no need for virtualization
technology. Yet, in the 1990s, virtualization technology brought again to the market after
introducing VMware Inc. at Stanford University. Nowadays, virtualization is widely used to
reduce management costs by replacing a bunch of low-utilized servers by a single server [29].
4.3.Benefits of Virtualization
There a bunch of reasons that push many organizations to go for virtualization technology;
some of them are listed in [24, 29, 30] as follow:
Server Consolidation
It condenses multiple servers into one physical server that would host many virtual machines.
This feature allows the physical server to run at high rate of utilization, and it reduces at the
same time the hardware maintenance, power and cooling requirements’ costs.
Application Consolidation
Legacy applications might require newer hardware and/or operating systems. In this case,
virtualization can be used to virtualize the new requirements.
Sandboxing
Virtualization can provide secure and isolated environment by running virtual machines that
can be used to run foreign or less-trusted applications.
Multiple Simultaneous OS
26
It can provide the facility of having multiple simultaneous operating systems that can run
different types of applications.
Reducing Cost
Virtualization reduces cost deployment and configuration by ensuring less hardware, less
space and less staffing. Furthermore, virtualization reduces the cost of networking by
requiring less wirings, switches and hubs.
4.4. Virtualization Approaches
Virtualization can take different forms depending on which component of computer system is
applied on [31]. In this section, we will shed light on three famous virtualization techniques:
Full Virtualization, Para-virtualization, and Hardware Assisted Virtualization.
4.4.1. Full Virtualization
In full virtualization, guest OS is fully abstracted from the hardware level by adding
virtualization layer: VMM or hypervisor. In this case, the guest OS is not aware it is being
virtualized, and it requires no modifications. This approach provides each VM with all
services of the physical system, including virtual BIOS, virtual devices and virtualized
memory management. To manage the communication between different layers, full
virtualization provides both binary translation and direct execution techniques (Figure 4).
Binary translation is used to convert guest OS instructions into host instructions. On the other
hand, application or user level instructions are directly executed on the processor to ensure
high performance [32]. Microsoft Virtual Server is an example of full virtualization.
Figure 4: Full virtualization architecture [32]
27
4.4.2. Paravirtualization
The fundamental issue with full virtualization is the emulation of devices within the
hypervisor. This issue was solved by developing paravirtualization technique which allows
the guest OS to be aware that it's being virtualized and to have direct access to the underlying
hardware. In paravirtualization, the actual guest code is modified to use a different interface
that accesses the hardware directly or the virtual resources controlled by the hypervisor [32].
In more details, paravirtualization changes the OS kernel to replace non-virtualized
instructions with hypercalls that communicate directly with the hypervisor. Thus, when a
privileged command is to be executed on the guest OS, it is delivered to the hypervisor
(instead of the OS) by using a hypercall; the hypervisor receives this hypercall and accesses
the hardware to returns the needed result (Figure 5). Xen is one of the systems that adopt
paravirtualization technology.
Figure 5: Paravirtualization architecture [32]
The downside of paravirtualization is that the guest must be modified to integrate hypervisor
awareness. This is a limitation as some operating systems do not allow such modifications
(e.g. Windows 2000/XP), and even the ones that can be modified may need additional
resources for maintenance/troubleshooting [32].
4.4.3. Hardware Assisted Virtualization
Hardware Assisted Virtualization allows VMM to run directly on the hardware. In this case,
VMM controls the access of the guest OS to the hardware resources. As depicted in Figure 6,
privileged and sensitive calls are sent directly to the hypervisor, removing the need for binary
translation and paravirtualization. VMWare ESX Server is one of the main competing VMMs
that use this approach [29].
28
Figure 6: Hardware assisted virtualization architecture [32]
4.5.Virtual Machine Manager
As defined before, hypervisor or VMM is the layer between the operating system and a guest
operating system or the layer between the hardware and the guest operating systems. In [25],
the author has set three main features that need to be maintained by VMM. First feature
demonstrates that VMM has to provide an environment that is identical with the original
machine that we want to virtualize. Second feature shows that programs running on VM or
original machine should show the same performance, or, with some minor decrease. Finally,
last feature states that VMM needs to control all system resources provided to VMs.
4.5.1. Hypervisor Types
Hypervisors are classified into Type 1 Hypervisor and Type 2 Hypervisor. Type 1 runs
directly on the system hardware, and therefore they monitor the operating system guests and
they allocate all the needed resources including disk, memory, and CPU and I/O peripherals.
Having no intermediary between Type 1 hypervisor and the physical layer has led to an
efficient performance in terms of hardware access and security level (Figure 7-a). On the
other hand, Type 2 hypervisor runs on host operating system that provides virtualization
services such as I/O and memory management (Figure 4-b). Having an intermediary layer
between the hypervisor and the hardware makes the installation process easier than Type 1
hypervisor since the operating system is in charge of hardware configuration such as
networking and storage [33].
29
Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor [33]
The differences between Type 1 and Type 2 hypervisor can lead to different performance
results. The layer between the hardware and the hypervisor in Type 2 makes the performance
less efficient than in Type 1. A sample scenario that illustrates this difference is when a virtual
machine requires a hardware interaction (reading from disk); in this case, Type 2 hypervisor
needs first to pass the request to the operating system and then the hardware layer. Besides
performance efficiency, the reliability of Type 1 hypervisor is higher than in Type 2
reliability. For instance, the failure in operating system can directly affect the hosted guests in
Type 2 hypervisor; therefore, the availability of hypervisor type 2 is highly related to the
operating system availability. However, hypervisor type 2 has some advantages which consist
in having fewer hardware/driver issues as the host operating system is responsible for
interfacing with the hardware [34].
4.5.2. Examples of Hypervisors
a) Xen Hypervisor
Xen hypervisor is a Type 1 or bare metal hypervisor that is widely used for paravirtualization
[35]. It is managed by a specific privileged guest (privileged VM) called Domain-0 (Dom0).
Dom0 runs on the hypervisor, and it is responsible of managing all aspects of other
unprivileged virtual machine that are known as DomainU (DomU). Furthermore, Dom0 has
direct access for the resources on the physical, which is not the case for DomU guests [36].
Overall architecture of Xen hypervisor is shown in Figure 8.
30
Figure 8: Xen hypervisor architecture
Xen uses paravirtualization as well as full virtualization. In paravirtualization, DomU are
referred to DomU PV Guests, and they can be modified Linux operating systems, Solaris,
FreeBSD, and other UNIX operating systems [37]. DomU PV Guests are aware that they are
running in a virtualized environment, and they don’t have direct access to the hardware
resources. In this case, the guest operating system is modified to make special calls
(hypercalls) to the hypervisor for privileged operations, instead of the regular system calls in a
traditional unmodified operating system. On the other, in full virtualization, DomU are
referred to as DomU HVM Guests and run standard any unchanged operating system [37].
DomU HVM is not aware that it is sharing processing time on the hardware, and it is not
aware of the presence of other virtual machines. In this case, DomU HVM requires processors
which specifically support hardware virtualization extensions (Intel VT or AMD-V).
Virtualization extensions allow for many of the privileged kernel instructions (which in PV
were converted to "hypercalls") to be handled by the hardware using the trap-and-emulate
technique.
b) KVM Hypervisor
KVM hypervisor provides a full virtualization solution based on Linux operating system. It
works by reusing the hardware assisted virtualization extensions that were already developed.
In this case, KVM requires the presence of Intel VT or AMD-V extensions on the host
system. When KVM is loaded, it converts the kernel into a bare metal hypervisor. As a result,
it takes; as mentioned above, a full advantage of many components which are already present
within the kernel such as memory management and scheduling [38]. KVM is implemented
using two main components; the first one is the KVM-loadable module that, when installed in
the Linux kernel, provides management of the virtualization hardware (Figure 9). The second
component provides PC platform emulation, which is offered by a modified version of
31
QEMU. QEMU is executed as a user-space process, coordinating with the kernel for guest
operating system requests [39].
Figure 9: KVM hypervisor architecture
c) VMware ESXi Hypervisor
VMware was the first leader company that contributed to virtualization technology. One of its
virtualization products is VMware ESXi which is installed directly on top of the physical
machine [40]. VMware ESXi was introduced in 2007 to provide the highest levels of
reliability and performance to companies of all sizes. The overall architecture of VMware
ESXi is illustrated in Figure 10. The main component is the vmkernel which contains all the
necessary processes to manage VMs. It provides certain functionality similar to that found in
other operating systems, such as process creation and control, signals, file system, and process
threads. Therefore, vmkernel supports running multiple virtual machines and provides some
core functionalities like: Resource scheduling, I/O stacks and Device drivers [24].
Figure 10: VMware ESXi architecture [40]
32
Chapter 4: Big Data and High Performance
Computing as a Service
As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of
users and data generated, the capacity and computing power of current data tools lead to
inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates
that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today
has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of
data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The
increase in data size to many terabytes and petabytes is known as Big Data. To handle the
complexity of Big Data, HPC is adopted to provide high computation capabilities, high
bandwidth, and low latency network. This chapter provides an overview of Big Data
phenomena and HPaaS concept.
Figure 11: Data growth over 2008 and 2020 [54]
5.1.Big Data
5.1.1. Big Data Definition
Big Data is defined as large and complex datasets that are generated from different sources
including social media, online transactions, sensors, smart meters and administrative services
[43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of
storing, analyzing and processing data. Literature reviews on Big Data divided the concept
into four dimensions: Volume, Velocity, Variety and Value [43].
33
Volume: the size of data generated is very large, and it goes from terabytes to petabytes.
Velocity: data grows continuously at an exponential rate.
Variety: data are generated in different forms: structured data, semi-structured and
unstructured data. These forms require new techniques that can handle data
heterogeneity.
Value: the challenge in Big Data is to identify what is valuable as to be able to capture,
transform and extract data for analysis.
5.1.2. Big Data Technologies
With Big Data phenomenon, there is an increasing demand for new technologies that can
support the volume, velocity, variety and value of data. Some of the new technologies are
NoSQL, parallel and distributed paradigms and new cloud computing trends that can support
the four dimensions of big data.
NoSQL (Not Only Structured Query Language) is the transition from relational databases to
non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability
to replicate and to partition data over many servers, and the ability to provide high
performance operations. However, moving from relational to NoSQL systems has eliminated
some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability)
[45]. In this context, NoSQL properties are defined by CAP theory [46] which states that
developers must make trade-off decisions between consistency, availability and partitioning.
Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and
CouchDB [50].
Other supporting technologies for Big Data are parallel and distributed paradigms (e.g.
Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in
the upcoming chapters (Part III- Chapter 8, 9).
5.2. High Performance Computing as a Service (HPCaaS)
5.2.1. HPCaaS Overview
High Performance Computing (HPC) is used to process and analyze large and complex
problems, including scientific, engineering and business problems that require high
computation capabilities, high bandwidth, and low latency network [3]. HPC fits these
requirements by implementing large physical clusters. However, traditional HPC faces a set
34
of challenges that consist in peak demand, high capital, and high expertise to acquire and
operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of
new technology trends including, cloud technologies, parallel processing paradigms and large
storage infrastructures. Merging HPC with these new technologies has proposed new HPC
model, called HPC as a service (HPCaaS).
HPCaaS is an emerging computing model where end users have on-demand access to pre-
existing needed technologies that provide high performance and scalable HPC computing
environment [52]. HPCaaS provides unlimited benefits because of the better quality of
services provided by the cloud technologies, and the better parallel processing and storage
provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some
HPCaaS benefits are stated in [51] as follow:
High Scalability: resources are scaling up as to ensure essential resources that fit users’
demand in terms of processing large and complex datasets.
Low Cost: End-users can eliminate the initial capital outlay, time and complexity to
procure HPC.
Low Latency: by implementing the placement group concept that ensures the execution
and processing of data in the same rack or on the same server.
5.2.2. HPCaaS Providers
There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin
Computing [53] which has been a leader in designing and implementing high performance
environments for over a decade. Nowadays, it provides HPCaaS with different options: on-
demand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services
(AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform
HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different
pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is
currently used for Computer Aided Engineering, molecular modeling, genome analysis, and
numerical modeling across many industries including Oil and Gas, Financial Services and
Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure
HPC) [56] and Google (Google Compute Engine) [57].
35
Chapter 5: Literature Review and Research
Contribution
In order to bridge the gap between the present research and previous studies, a review was
conducted on the current state of HPC and virtualization. Therefore, this chapter situates the
research in relation to previous research publications and states clearly the research
contribution.
5.1. Related Work
There have been several studies that evaluated the performance of high computing in the
cloud. Most of these studies used Amazon EC2 [20] as a cloud environment [58-63]. Besides,
only few studies have evaluated the performance of high computing using the combination of
both new emerging distributed paradigms and cloud environment [64].
In [58], authors have evaluated HPC on three different cloud providers: Amazon EC2, GoGrid
Cloud and IBM Cloud. For each cloud platform, they run HPC on Linux virtual machines
(VM), and they came up to the conclusion that the tested public clouds do not seem to be
optimized for running HPC applications. This was explained by the fact that public cloud
platforms have slow network connections between virtual machines. Furthermore, authors in
[13] evaluated the performance of HPC applications in today's cloud environments (Amazon
EC2) to understand the tradeoffs in migrating to the cloud. Overall results indicated that
running HPC on EC2 cloud platform limits performance and causes significant variability.
Besides Amazon EC2, a research done in [63] evaluated the performance-cost tradeoffs of
running HPC applications on three different platforms. First and second platform consist of
two physical clusters (Taub and Open Cirrus cluster), and the third platform consists of
Eucalyptus. Running HPC on these platforms led authors to conclude that cloud is more cost-
effective for low communication-intensive applications.
In order to understand the performance implications on HPC using virtualized resources and
distributed paradigms, authors in [64] performed an extensive analysis using Eucalyptus (16
nodes) and other technologies such as Hadoop [7], Dryad and DryadLINQ [65], and
MapReduce [6]. The conclusion of this research suggested that most parallel applications can
be handled in a fairly and easy manner when using cloud technologies (Hadoop, MapReduce,
36
and Dryad); however, scientific applications, which require complex communication patterns,
still require more efficient runtime support.
Evaluating HPC without relating it to new cloud technologies was also performed using
different virtualization technologies [66, 67, 68, 69]. In [66], authors performed an analysis of
virtualization techniques including VMWare, Xen, and OpenVZ. Their findings showed that
none of the techniques match the performance of the base system perfectly; yet, OpenVZ
demonstrates high performance in both file system performance and industry-standard
benchmarks. In [67], authors compared the performance of KVM and VMware. Overall
findings showed that the VMWare performs better than KVM. Still, in few cases KVM gave
better results than VMWare. In [68], authors conducted quantitative analysis of two leading
open source hypervisors, Xen and KVM. Their study evaluated the performance isolation,
overall performance and scalability of virtual machines for each virtualization technology. In
short, their findings showed that KVM has substantial problems with guests crashing (when
increasing the number of guests); however, KVM still has better performance isolation than
Xen. Finally, in [69] authors have extensively compared four hypervisors: Hyper-V, KVM,
VMWare, and Xen. Their results demonstrated that there is no perfect hypervisor.
5.2.Contribution
So far, there are only few studies that compared different virtualization techniques and its
impact on HPC in the cloud. The only study we found was done in [70], where authors
compared the performance of adopting Xen, KVM and Virtual Box. Each virtualization
technology was compared with bare-metal using a set of high performance benchmarking
tools. The results of this research demonstrated that KVM is the best choice for HPC in the
cloud because of its rich features and near-native performance.
The contribution of this present research will fill the literature gap by examining the impact of
virtualization techniques on HPCaaS using OpenStack as a cloud platform and Hadoop as a
distributed and parallel system.
37
Part III: Technology
Enablers
This part explains the use of OpenStack and Hadoop as underlying technologies for this
research. Hence, this part starts first with providing a qualitative study for selecting an
appropriate cloud platform and distributed system; second chapter of this part introduces in
details OpenStack components, and third chapter presents Hadoop and its main aspects.
38
Chapter 6: Technology Enablers Selection
The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built
after conducting a qualitative study of available tools in the market. We targeted mainly open
sources to select appropriate cloud computing platform and distributed system. Hence, this
chapter presents the analysis we followed in selecting cloud platform and distributed system.
6.1.Cloud Platform Selection
To compare available cloud open sources, we tried to choose the most popular platforms. The
selection of competing platforms was based on a study that compares the popularity of
OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12,
the study showed that OpenStack has the largest total population index, followed by
Eucalyptus, CloudStack, and Opennebula.
Figure 12: Active cloud community population [71]
Based on Figure 12, we selected to compare and study OpenStack, Opennebula and
Eucalyptus. To adopt one of these cloud open sources, we used some other studies that
compare their performance and quality [72-75].
In [72], authors compared some open and commercial cloud platforms. Concerning open
platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they
adopted a set of criteria, including storage, virtualization, network, management, security and
vendor support. The results of the research showed that open-source and commercial solutions
39
can have comparable features, and that OpenNebula is the most feature-complete cloud
platform when it is compared with Eucalyptus.
[73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors
compared the performance of both cloud platforms based on measuring the time when the
cloud starts instantiating VMs and the time when they are ready to accept SSH connections.
The findings of the research demonstrate that OpenStack is slightly better than OpenNebula
due to smaller instantiation time. Moreover, the results showed that OpenStack is more
suitable for high computing due to faster instantiation of large number of VMs. In [74],
authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula.
For the qualitative analysis, they adopted some of the following criteria: security,
virtualization supported, access, image support, resource selection, storage support, high-
availability support and API support. Based on the results of the qualitative study, authors
concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would
benefit in case of persistent storage support. For the quantitative analysis, authors measured
the deployment, network overhead and the clean-up time of VMs. The results of quantitative
analysis showed that each platform can be used depending on user requirements and
specifications.
In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack,
OpenNebula and CloudStack. To perform the comparison, authors adopted the following
criteria: storage, network, security, hypervisor, scalable and installation code openness. In
short, the results of this study [75] showed that OpenStack is the preferred cloud open source.
Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go
for OpenStack as it is known for its flexibility and total openness.
Table 2 : Cloud IaaS selection
40
6.2.Distributed and Parallel System Selection
To compare available distributed and parallel systems in the market, we opted again for the
popularity index of those systems. The selection of competing systems was based on a study
done in [76]. The study is summarized in Figure 13 which compares the popularity index of
Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The
study was done in 2012, and it demonstrates the total downloads between January 2011 and
March 2012. Figure 13 depicts that Hadoop is the most popular distributed system, followed
by MongoDB and Cassandra.
Figure 13: Active distributed systems population [76]
Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in
order to end up with one selected system for the present research.
MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data
in tables with columns and rows. To provide high redundancy and make data highly available,
MongoDB offers replication across multiple servers. While data is synchronized between
servers using replication, MongoDB also facilitates the scale out option by supporting
sharding which partitions a collection and stores the different portions on different machines.
MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On
the other hand, Hadoop is an open source for distributed file system that supports processing,
analyzing and storing large data sets across large clusters using MapReduce paradigm and
HDFS [7]. More details about Hadoop are included in chapter 8.
41
A study done in [77] compares MongoDB and Hadoop systems. The study came up with three
main conclusions; first, it is not appropriate to use MongoDB as an analytics platform;
second, using Hadoop for MapReduce jobs is several times faster than using the built-in
MongoDB MapReduce capability, and third, MongoDB is much slower than HDFS. Besides,
a study was done in [78] did a comparison of Map-Reduce Performance of Hadoop and
MongoDB. In short, the study showed that MongoDB is roughly four times slower than
Hadoop in fully-distributed mode.
Table 3 summarizes the selected distributed system in [77] and [78]. Based on this table, we
decided to go for Hadoop as an analytical and storage tool for the present research.
Table 3 : Parallel and distributed platform selection
42
Chapter 7: Openstack
OpenStack is an open source platform for public and private cloud computing that aims at
ensuring scalability and flexibility. It was developed by a wide range of developers and
contributors using mainly Python (68%), XML (16%) and JavaScript (5%) [79]. This chapter
provides detailed description of Openstack including, brief history; its components, the
corresponding architecture, and finally some supported hypervisors.
7.1.OpenStack Overview
The formal definition of OpenStack was stated in [80], which defines OpenStack as:” a cloud
operating system that controls large pools of compute, storage, and networking resources
throughout a datacenter, all managed through a dashboard that gives administrators control
while empowering their users to provision resources through a web interface”. From this
definition, OpenStack is considered as an Infrastructure as a Service (IaaS).
An important feature of OpenStack is that it provides a web interface called dashboard and
APIs that make its services available via Amazon EC2 and S3 compatible APIs. This feature
ensures that all existing tools that work with Amazon’s cloud platform, can also work with
OpenStack platform [81].
7.2.OpenStack History
OpenStack was a collaboration project between Rackspace Hosting and NASA. Both
organizations planned to release internal cloud project object storage and compute. Rackspace
contributed with their Cloud Files platform to support the storage part of OpenStack, while
NASA contributed with their Nebula platform to support the compute part [82]. In July 2010,
both organizations released the first version of OpenStack under Apache 2.0 License. In
September 2012, OpenStack Foundation was established as an independent entity with the
mission of protecting, empowering, and promoting OpenStack software. Now, OpenStack
project is currently supported by more than 150 companies including AMD, Intel, Canonical,
Red Hat, Cisco, Dell, HP, IBM and Yahoo! [83].
7.1.OpenStack Releases
OpenStack releases different versions with new improvement and contributions. All
OpenStack releases since 2010 are listed in Table 4 [79].
43
Table 4 : OpenStack releases [79]
7.3.OpenStack Components
The core components of OpenStack software are: OpenStack Compute Infrastructure (Nova);
OpenStack Object Storage Infrastructure (Swift) and OpenStack Image Service Infrastructure
(Glance). Besides these components, OpenStack include Identity Service (Keystone),
Network Service (Quantum), Dashboard Service (Horizon) and Block Storage (Cinder).
Table 5 summarizes the main components of OpenStack and the corresponding code name.
Table 5 : OpenStack projects
Taking into consideration the previous mentioned OpenStack components, a conceptual
architecture of OpenStack is provided in Figure 14 which shows how OpenStack components
are interconnected [79].
44
Figure 14: OpenStack conceptual architecture [79]
7.3.1. OpenStack Compute (Nova)
Nova provides flexible management for virtual machines by allowing users to create, update,
and terminate virtual machines. The overall architecture of Nova (Figure 15) is composed of
the following sub-components: nova-api, nova-scheduler, nova-compute, nova-volume, queue
and database [82].
Figure 15: Nova subcomponents
45
Nova-api is responsible of accepting and fulfilling the API requests. A request consists of
actions that will be performed by nova subcomponents. In order to accept an API request,
nova-api provides an endpoint for all API queries and enforcing some policies. If the request
is about managing virtual machines, the nova-compute is involved to be in charge of creating
or terminating a virtual machine instances. Normally, nova-compute receives requests from
the queue sub-component. In order to manage virtual machine instances, nova-compute uses
different ways and drivers such as libvirt software package, Xen API, vSphere API, etc. to
support virtualization technologies. To specify where to send a request, nova-scheduler
retrieves the request from the queue and determines which compute server host it should run
on. In case there is a need for memory space, nova-volume does the creation, attachment, and
detachment of persistent volumes to virtual machine instances [82].
Nova also provides network management by its subcomponent nova-network. The latter
accepts networking tasks from the queue and then performs system commands to manipulate
the network. Nova-network defines two types of IP addresses: Fixed IPs and Floating IPs.
Fixed IP is considered as a private IP that is assigned to an instance during its life cycle. On
the other hand, floating IP is considered as a public IP that will be used for external
connectivity. The network itself that is defined in nova-compute can be classified into three
categories: Flat, FlatDHCP and VLAN network [82].
Flat assigns a fixed IP address to an instance and attaches that IP on common bridge
(created by the administrator).
FlatDHCP builds upon the Flat manager by providing DHCP services to handle
instance addressing and creation of bridges.
VLAN provides a subnet, and a separate bridge for each project. The range of IPs of a
given project is only accessible within the VLAN.
The last subcomponents of nova are queue and database. Queue is responsible of passing
messages between nova sub-components to facilitate the communication between them. It is
implemented using RabbitMQ. Nova database stores most of the configuration and run-time
state of the cloud infrastructure; it contains a set of tables such as: instance types, instances in
use, networks available, fixed IPs, projects and virtual interfaces [82].
7.3.2. OpenStack Object Storage (Glance)
Glance manages virtual disk images. It consists of three main sub-components glance-api,
glance-registry and glance database (Figure 16). Glance-api accepts incoming API requests
46
and then communicates them to other components (glance-registry and image store). All
information about images is stored in glance-database. Last component which is glance-
registry is responsible of retrieving and storing metadata about images [82].
Figure 16: Glance subcomponents
7.3.3. OpenStack Identity Service (Keystone)
Keystone authorizes users’ access to OpenStack components. It supports multiple forms of
authentication including standard username and password credentials and token-based
systems. Keystone architecture is represented by the following subcomponents (Figure 17):
token backend, catalog backend, policy backend and identity backend [82].
Figure 17: Keystone subcomponents
7.3.4. OpenStack Object Store (Swift)
Swift is the oldest project within OpenStack, and it is the underlying technology that powers
Rackspace’s Cloud Files service [82]. Swift provides a massively scalable and redundant
object store by writing multiple copies of each object to multiple and separated storage
47
servers as to handle failures efficiently. Swift component consists of Proxy Server, Account
Server, Container Server, and Object Server (Figure 18).
Figure 18: Swift subcomponents
Swift-proxy accepts incoming requests that consists of uploading files, making modifications
to metadata and creating containers. Requests are served by account server, container server
or object server. Object servers request about managing pre-existing objects or files in the
storage; account server manages accounts defined with the object storage service, and
container server manages the mapping of containers, folders, within the object store service
[82].
7.3.5. OpenStack Block Storage Service (Cinder)
Cinder allows block devices to be connected to virtual machine instances for better
performance. It consists of the following sub-components: cinder-api, cinder-volume, cinder-
database and cinder-scheduler (Figure 19).
Cinder-api accepts incoming requests and directs them to the cinder-volume which performs
reading or writing to the cinder database to maintain states and interacts with other processes.
Cinder-scheduler is responsible of selecting the optimal block storage node to create the
volume on. In order to maintain communication between cinder components, message queue
is used.
48
Figure 19: Cinder subcomponents
7.3.6. OpenStack Network Service (Quantum)
Quantum allows users to create their own networks and then attach interfaces to them. It
consists of quantum-server, quantum-account, quantum-plugin and quantum-database
(Figure 20). Quantum-server accepts incoming API requests and then directs them to the
correct quantum-plugin. Plugins and agents perform special actions such as plug/unplug ports,
creating networks, subnets and IP addressing. Finally, quantum-database stores networking
state for particular plugins.
Figure 20: Quantum subcomponents
49
7.4.OpenStack Supported Hypervisors
The abstraction feature provided by OpenStack Compute lead to support various existing
hypervisors. Some of the supported hypervisors are listed as follow: KVM, LXC, QEMU,
UML, VMWare ESX/ESXi, Xen, PowerVM, Hyper-V [79]. However, KVM is still the most
widely used hypervisor in deploying OpenStack. Besides KVM, more existing deployments
run Xen, LXC, VMWare and Hyper-V, but each of these hypervisors lack some features
support or the documentation on how to use them with OpenStack is not well documented.
50
Chapter 8: Hadoop
Hadoop has been adopted by big players in the market such as Google, Yahoo, LinkedIn,
Facebook, New York Times, IBM, etc. [84]. This chapter provides a detailed overview of
Hadoop, starting with a brief history of this open source, the corresponding architecture,
implementation and some related features.
8.1.Hadoop Overview
Hadoop is an Apache Java open source for distributed file system that supports processing,
analyzing and storing large data sets across large clusters using MapReduce paradigm and
HDFS [85]. Hadoop has been designed to be reliable, fault tolerant and scalable project that
can scale up from one single machine to thousands of machines.
8.2.Hadoop History
In 2002, Hadoop was created by Doug Cutting as an open source for web crawling and
indexing, and it was first named Nutch project. Nutch was developed to handle searching
issues, but it faced the scalability problem as it wouldn’t scale up to billions of web pages. To
deal with this issue, Nutch team got inspired by Google’s distributed filesystem (GFS). By
adopting GFS architecture in 2004, Nutch team has delivered an open source called Nutch
Distributed Filesystem (NDFS) [86].
When Google published its paper about MapReduce algorithm, Nutch team has tried to get
advantage of that work by introducing MapReduce to its NDFS system. Implementing both
NDFS and MapReduce made Nutch as a powerful system for web crawling and indexing.
This success has pushed Nutch team to build an independent project in 2006 named Hadoop
project. By this time, Doug Cutting joined Yahoo!, which provided enough resources to
improve Hadoop performance. Even if Yahoo! has developed and contributed to 80% of
Hadoop project, Hadoop was made its own top-level project at Apache in January 2008 [87].
Besides implementing MapReduce and HDFS algorithms, Hadoop project includes other sub-
projects that are listed in Table 6 [85].
51
Table 6: Apache Hadoop subprojects
Hadoop subprojects are grouped and named Hadoop Ecosystem. The overall picture of
Hadoop Ecosystem is illustrated in Figure 21.
HDFS
(Hadoop Distributed File System)
MapReduce (Job Scheduling / Excution System)
Pig (Data Flow) Hive (SQL) Sqoop
ETL Tools BI Reporting RDBMS
HBase
Zookeeper
Avro
Figure 21: Apache Hadoop subprojects [85]
8.3.Hadoop Architecture
Hadoop implements master/slave architecture, where master is named NameNode and slave is
named DataNode. NameNode manages the file system namespace that consists of a hierarchy
of files and directories used for data storage. When a file is created by client application, it is
divided into blocks; each block is replicated and stored in DataNodes. In this case,
information about the replicas numbers (number of block copies) and the mapping of replicas
and blocks are stored in the NameNode. On the other hand, each DataNode is in charge of
52
managing storage attached to the node in which it is running on. Furthermore, each DataNode
handles the read operation, write, block creation, deletion, and replication that come as
instructions from the NameNode [86].
Besides NameNode and DataNodes, Hadoop cluster consists of Secondary NameNode
(backup node for NameNode), JobTracker and TaskTracker. JobTracker is located in the
master node, and it is responsible of distributing MapReduce tasks to other nodes in the
cluster. On the other hand, TaskTracker runs locally tasks distributed by the JobTracker; each
slave in the cluster contains one TaskTracker that can also run on master node [86].
The overall architecture of Hadoop is illustrated in Figure 22.
Figure 22: Hadoop Architecture
8.4.Hadoop Implementation
Hadoop is mainly implemented using HDFS and MapReduce paradigm. HDFS is used to
store large data sets while MapReduce is used to analyze and process data across Hadoop
cluster. Taking into consideration the architecture provided in Figure 22, HDFS concept is
represented by the NameNode, Secondary NameNode and DataNodes, while MapReduce is
represented by the JobTracker and TaskTracker (Figure 23).
53
Figure 23: HDFS and MapReduce representation
8.4.1. HDFS Overview
HDFS is designed as a hierarchy of files and directories. Each file is divided into blocks that
are stored in different DataNodes. NameNode stores only the metadata that includes
information about blocks’ locations and the number of copies of each block. Furthermore,
HDFS allows NameNode to perform the namespace operations such as opening, closing and
renaming files and directories. As stated before, HDFS performs data replication to ensure
fault-tolerance. The replication factor is set when a file is created, and it can be modified later
[85].
An example that illustrates the HDFS process is the read, write and creation operations.
During the read operation, the HDFS request from the NameNode the list of DataNodes that
host replicas of the blocks of a given file. The list is sorted by the network topology distance
from the client. After deciding on the DataNode from where to fetch data, The HDFS client
contacts directly the DataNode and requests the desired block. On the other hand, during the
write operation, the HDFS asks the NameNode to choose DataNodes that will store replicas of
the first block of the file, second block and so on as so far. For each block, the client
organizes a pipeline from node-to-node and sends the data. When the first block is filled, the
client requests new DataNodes to be chosen to host replicas of the next block. Concerning the
creation operation, when there is a request to create a file, the HDFS caches first the file into a
temporary local file. When the latter accumulates data up to the HDFS block size, the HDFS
54
contacts the NameNode to insert the file name into the file system namespace and allocate a
data block for it. After that, the NameNode selects the DataNodes that will host the data
blocks. At this stage, the client moves the block of data from the local temporary file to the
specified DataNode [85].
8.4.2. MapReduce Overview
Hadoop MapReduce is a programming paradigm that processes very large data sets in parallel
manner on large clusters. It was first introduced by Google in 2004 [6]. The core idea of
MapReduce is splitting the input data set into chunks that will be processed by map tasks in a
parallel manner. The output of each map task is sorted to be then directed as an input to the
reduce task. Taking into consideration the previous definition, MapReduce can be classified
into two steps: map step and reduce step [88].
Map task process is divided by itself into five phases: read, map, collect, spill and merge. The
read phase consists of reading the data chunk from the HDFS, and then creating the input
key-value. Map phase is about executing the user-defined map function to generate the map-
output data. Collect phase performs the collection of intermediate (map-output) data into a
buffer before spilling. Spilling process sorts, performs compression, if specified, and writes to
local disk to create file spills. The last step in the map task is the merge phase which merges
all file spills into one single map output file [88] .
Reduce task is also divided into four phases: shuffle, merge, reduce and reduce phase. Shuffle
phase transfers the intermediate data (map output) from the mapper slaves to a reducer's node
and decompressing if needed. Merge phase performs the merging of the sorted outputs that
come from different mappers to be directed as the input to the reduce phase. Reduce phase
executes the user-defined reduce function to produce the final output data. Finally, write
phase compresses, if needed, and writes the final output to HDFS [88] .
A popular example that illustrates the MapReduce execution is the Words Count example
which counts the number of occurrence of each individual word in a given file (Figure 24)
[89].
55
Figure 24: Word count MapReduce example [89]
8.5.Hadoop Cluster Connectivity
When Hadoop starts connecting, each DataNode performs a handshake with the NameNode.
The purpose of this operation is to verify the namespace ID and the software version of the
DataNode. The namespace ID is assigned to the filesystem instance when it is formatted, and
it is stored in all nodes of the cluster. Nodes with a different namespace ID will not be able to
be part of the cluster. However, if the namespace ID is the same, the handshake will be
performed successfully between the DataNodes and the NameNode. At this point, each
DataNode stores its unique storage ID, which is an internal identifier of the DataNode. The
main purpose of this ID is to make the DataNode recognizable even if it is restarted with a
different IP address or port [87].
During normal operation, DataNodes send heartbeats to the NameNode to confirm that the
DataNode is operating and the block replicas it hosts are available. The default heartbeat
interval is three seconds. In case the NameNode does not receive a heartbeat from a DataNode
in ten minutes, the NameNode considers the DataNode as a dead node. In this case,
NameNode creates new replicas of those blocks on dead DataNodes. In fact, heartbeats are
not only used for ensuring NameNode-DataNodes connectivity, but it is also used to send
statistical information such as total storage capacity, and fraction of storage in use. Another
benefit of heartbeats is to send instructions from the NameNode to DataNodes. Those
instructions include commands to replicate blocks to other DataNodes, remove local block
56
replicas, reregister and send an immediate block report, and shut down the node. These
commands are important for maintaining the overall system integrity and therefore it is
critical to keep heartbeats frequent even on big clusters. The NameNode can process
thousands of heartbeats per second without affecting other NameNode operations [87].
57
Part III: Research
Contribution
To clarify the steps we followed in this study, we divided this part into four chapters 9, 10, 11
and 12. Chapter 9 defines the research methodology; chapter 10 describes the experimental
setup that we used to get the performance of HPCaaS; chapter 11 presents the results we got
from each experiment, and finally, chapter 12 discusses and analyzes the research findings.
58
Chapter 9: Research Methodology
The choice of research methodology depends mainly on the nature of the research question.
This chapter discusses the methodology that was followed in conducting the present study. It
explains first the choice of the selected methodology, and then it demonstrates an overall
picture of the research steps.
9.1.Research Approach
The present research was based on a combination of qualitative and quantitative approach.
Qualitative approach was followed to compare and select appropriate technology enablers for
this research (Part III, Chapter 7), whereas quantitative approach was adopted to provide
numeric measurements of HPC on physical cluster and virtualized clusters (Part IV, Chapter
10, 11 and 12),
9.2.Research Steps
Figure 25 summarizes the steps followed in conducting the present research.
Figure 25 : Research steps
59
Chapter 10: Experimental Setup
In order to investigate the research question, we have conducted three main experiments. The
first experiment evaluates the performance of HPC on Hadoop Physical Cluster (HPhC); the
second experiment evaluates the performance of HPC using Hadoop Virtualized Cluster
(HVC) with KVM, and the last experiment evaluates HPC using Hadoop virtualized cluster
with VMware ESXi virtualization technology.
This chapter describes the experiment setup used in this research; it provides an overall
picture of the three adopted clusters; it specifies the hardware, software and network
specifications; it introduces the benchmarks used to evaluate the performance of HPC on each
cluster; it lists the datasets sizes used in each experiment, and finally, this chapter explains the
experimental execution of the present research.
10.1.Experimental Hardware
In our performance study, we have built 3 different clusters: Hadoop Physical Cluster,
Hadoop Virtualized Cluster using KVM and Hadoop Virtualized Cluster using VMware
ESXi. Each cluster is composed of eight machines.
For the physical cluster, we used 8 Dell OptiPlex 755 Desktop computers with specifications
listed in Table 7. For both Hadoop virtualized clusters (KVM and VMware ESXi), we used a
Dell PowerEdge server with features listed in Table 8. On top of the server, we installed
OpenStack to create eight virtual machines using KVM hypervisor and then VMware ESXi
hypervisor. Because of some limited flexibility of OpenStack, we cloud create VMs with
features described in Table 9.
Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster)
60
Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster
Table 9 : OpenStack virtual machines’ features
10.2.Experimental Software and Network
As stated in chapter 6, we opted for Hadoop to process and store small and large datasets; we
chose to install Hadoop version 1.2.1. Concerning OpenStack, the version that was adopted is
Folsom Release which supports KVM, Xen, VMWare and other hypervisors. Networking
configuration was characterized by a bandwidth of 100Mbps per port.
10.3.Clusters Architecture
In this section, we will conceptualize each individual cluster in terms of its layers and
components.
10.3.1. Hadoop Physical Cluster
Figure 26 and 27 show an overall picture of Hadoop Physical Cluster. The configuration was
done in Linux Lab at AUI. The lab is connected to 1 Gbps switch (provides 100 Mbps per
port) that is also connected to other offices in the building (where the lab is allocated). As
61
both figures depict, the cluster contains eight machines where one machine was selected to be
the master and slave node at the same time. The reason behind choosing the master node to
serve as both master and slave node is to increase the cluster performance when processing
and storing datasets.
Figure 26 : Hadoop Physical Cluster
Figure 27: Hadoop Physical Cluster architecture
10.3.2. Hadoop Virtualized Cluster – KVM
The second cluster we built in this research is Hadoop Virtualized Cluster with KVM
technology. As Figure 28 shows, the first step in configuring the cluster is to install an
operating system on Dell PowerEdge server; the OS that was selected is Ubuntu Precise 12.04
62
LTS- 64 bits. The next step is to install and configuring KVM packages which are loaded in
Linux OS as KVM driver. After preparing the system with OS and KVM hypervisor, next
step is to install OpenStack on top of the OS (OpenStack with KVM documentation is
provided in Appendix A). Finally, last step is to configure Hadoop on top of each OpenStack
VM instance (Hadoop documentation is provided in Appendix C).
Figure 28: Hadoop virtualized cluster - KVM
The first OpenStack component that needs to be installed is the keystone which manages the
authentication to OpenStack resources. After downloading and installing the keystone
package, the next step is to create tenants (OpenStack projects) and OpenStack users that are
associated to one or more tenants. Each user can be a member or an admin in a given project;
in this case, roles need to be created in order to set rights and privileges to each user. After
creating users, tenants and roles, next step is to create OpenStack services (nova, keystone,
and glance service) that provide one or more endpoints (URLs) through which users can
access OpenStack resources. The second component to install is OpenStack glance which
allows creating and managing different formats of images (Ubuntu, Fedora, Windows, etc.)
Glance packages include glance-api that accepts incoming API requests; glance-database that
stores all information about images, and finally glance-registry that is responsible of
retrieving and storing metadata about images. Third component to deploy in OpenStack is the
Nova package which includes nova-compute, nova-scheduler, nova-network, nova-
objectstore, nova-api, rabbitmq-server, novnc and nova-consoleauth. All these components
collaborate and communicate with each other to create and manage instances, networks and, if
needed, volumes. Finally, to have access to instances, a user friendly insterface can be
63
installed through configuring OpenStack dashboard or Horizon. After login to OpenStack
Dashboard, the user can launch instances with the possibility of specifying the number of
CPUs, disk space, total RAM memory per VM, etc.
After creating VM instances (with requirements listed in Table 9), we installed Hadoop 1.2.1
on each VM. Hadoop configuration starts with identifying the master node and slave nodes.
For master node, there are six files that need to be configured: core-site, hadoop-env, hdfs,
mapred-site, master and slaves files. Concerning slave nodes, the only files that need to be
configured are hadoop-env, core-site, hdfs and mapred-site files. When connecting nodes, the
cluster needs to be formatted as to clean the file namespace. After formatting Hadoop, the
cluster can be started to run jobs.
10.3.3. Hadoop Virtualized Cluster – VMware ESXi
The third cluster that was built in this research is Hadoop Virtualized Cluster using VMware
ESXi technology (Figure 29). The first step in configuring this cluster is to install VMware
ESXi on top of Dell PowerEdge server. Then, OpenStack is configured on top of the
hypervisor (OpenStack with VMware ESXi documentation is provided in Appendix B). After
configuring OpenStack, instances can be then created to build Hadoop cluster.
Figure 29: Hadoop virtualized cluster – VMware ESXi (a)
In fact, when installing OpenStack with VMware ESXi, Openstack is installed as a VM on top
of VMware ESXi hypervisor. Then, through OpenStack dashboard, instances can be created
as VMs on top of VMware ESXi hypervisor (Figure 30).
64
Figure 30 : Hadoop virtualized cluster – VMware ESXi (b)
10.4.Experimental Performance Benchmarks
To evaluate the impact of machine virtualization on HPCaaS, we adopted two main known
benchmarks: Terasort and TestDFSIO benchmarks [90]. TeraSort performance metrics consist
of measuring the average time to sort certain datasets, while TestDFSIO performance metrics
consist of measuring the execution time to write and read datasets. Table 10 summaries the
performance metrics used in evaluating HPCaaS.
Table 10 : Experimental performance metrics
10.4.1. TeraSort Description
TeraSort was developed by Owen O’Malley and Arun Murthy at Yahoo Inc [90]. It won the
annual general purpose terabyte sort benchmark in 2008 and 2009. It does considerable
computation, networking, and storage I/O, and is often considered to be representative of real
Hadoop workloads [90]. Terasort is divided into three main steps: Teragen, Terasort and
Teravalidate.
65
Teragen generates random data that will be sorted by Terasort. It writes the generated data as
a file of n rows, where each row is 100 bytes. Each row is formatted as follow: 10 bytes key,
10 bytes rowid and 78 bytes filler, where keys are random characters from the set ‘ ‘ .. ‘~’ ,
rowid is an integer that specifies the row id, and filler consists of 7 runs of 10 characters from
‘A’ to ‘Z’. When data is generated, TeraSort sorts this data using quicksort algorithm. The
latter is integrated with map/reduces tasks to use a sorted list of n-1 sampled keys that define
the key range for each reduce [9]. Finally, Teravalidate ensures that the output data of
TeraSort is sorted. It creates one map task per file in TeraSort’s output directory; in this case,
each map task ensures that each key is less than or equal to the previous one. Furthermore,
map task generates records with the first and last keys of the file; then the reduce tasks
ensures that the first key of file i is greater than the last key of file i−1. If there is any
unordered keys, Teravalidate reports this as an output of the reduce task [90]. (TeraSort
benchmark is documented in Appendix D)
10.4.2. TestDFSIO Description
TestDFSIO benchmark is used to check the I/O rate of Hadoop cluster with write and read
operations. Such benchmark can be helpful for testing HDFS by checking network
performance, and testing hardware, OS and Hadoop setup [90]. TestDFSIO is written in Java,
and its source code can be found in [91]. TestDFSIO is composed of TestDFSIO-Write and
TestDFSIO-Read. Both operations are performed by specifying the number of files and the
size of each file in megabyte [90]. (TestDFSIO benchmark is documented in Appendix D)
10.5 Experimental Datasets Size
In each experiment, we measured the performance of Hadoop cluster using different dataset
sizes. For TeraSort, we used 100 MB, 1 GB, 10 GB and 30 GB datasets, and for TestDFSIO,
we used 100 MB, 1 GB, 10 GB and 100 GB datasets. Table 11 summarizes the dataset sizes
used in this research.
Table 11 : Datasets size used for Hadoop benchmarks
66
10.6 Experiment Execution
We started conducting each experiment by scaling the cluster from three machines up to eight
machines. In other words, we test each benchmark on three machines, four machines… until
we reached eight machines. Furthermore, for each individual benchmark, we performed three
tests on 100MB, 1GB, 10 GB and 30 GB (TeraSort) and 100MB, 1GB, 10 GB and 100 GB
(TestDFSIO), then we calculated the mean to avoid any outliers and to provide more accurate
results. Figure 31 simplifies the steps of running experiment 1 on HPhC using Terasort
benchmark.
Figure 31 : Experimental execution
67
Chapter 11: Experimental Results
This chapter presents the findings we got from running each experiment. It presents the results
of running HPC on HPhC; on HVC with KVM, and then the results of running HPC on HVC
using VMware ESXi. Last section, compares the results we got from running each
experiment. (The results we got from running experiments are listed in Appendix E and F)
11.1.Hadoop Physical Cluster Results
11.1.1. TeraSort Performance on HPhC
Running TeraSort benchmark showed that it needs much time to sort large datasets like 10
GB and 30 GB. Yet, scaling the cluster to more nodes led to significant time reduction in
sorting datasets. The results we got from running this benchmark on Hadoop Physical Cluster
are listed in Table 12 and conceptualized in Figure 32.
Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different
number of nodes- Hadoop Physical Cluster
Figure 32: TeraSort performance on Hadoop Physical Cluster
68
Figure 33 and 34 illustrate clearly the benefit of scaling the cluster. For instance, running
100MB with 3 nodes needs around 21.33 seconds, while with 8 nodes, it needs 19.97 seconds
(reduced by 6%). In the case of 1GB, the average time was reduced by 4% when scaling from
3 to 8 nodes.
Figure 33: TeraSort performance for
100 MB on Hadoop Physical Cluster Figure 34 : TeraSort performance for 1 GB on
Hadoop Physical Cluster
Concerning 10GB, the results were somehow different (Figure35). Sorting 10 GB was reduced
by 18.55% when scaling from 3 to 6 machines. Yet, increasing the number of machines to 8
nodes led to significant reduction in sorting performance. This can be explained by the impact
of network bottleneck, especially that Hadoop is highly influenced by this issue. Furthermore,
the impact of 8 nodes was important when running large datasets like 30 GB (Figure 36). For
this case, the average time to sort the dataset was reduced by 24.77% (difference of 42
minutes) when increasing the number of nodes from 3 to 8.
Figure 35: TeraSort performance for
10 GB on Hadoop Physical Cluster Figure 36 : TeraSort performance for 30 GB on
Hadoop Physical Cluster
69
11.1.2. TestDFSIO- Write Performance on HPhC
Running TestDFSIO-Write on Hadoop physical cluster follows in general one pattern.
Meaning, as the number of VMs increases, the average time decreases when writing different
dataset sizes. Table 13 and Figure 37 list and illustrate the results we got from running
TestDFSIO-Write on HPhC.
Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and
different number of nodes- Hadoop Physical Cluster
Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster
Zooming on TestDFSIO-Write for 100MB dataset (Figure 38), the average time for running
TestDFSIO-Write decreased as the number the of slaves increases. In this case, scaling the
cluster from 3 machines (including the master) to 8 machines led to a reduction of 11.25% in
overall writing average time. The same observation is applied when running TestDFSIO-
Write for 1GB dataset (Figure 39) where the average time was reduced by 46.5 % when
scaling from 3 to 8 slaves.
70
Figure 38: TestDFSIO-Write performance for
1 GB on Hadoop Physical Cluster Figure 39 : TestDFSIO-Write performance
for 100 MB on Hadoop Physical Cluster
Figure 40: TestDFSIO-Write performance for
10 GB on Hadoop Physical Cluster Figure 41 : TestDFSIO-Write performance
for 100 GB on Hadoop Physical Cluster
When running 100 GB (Figure 41), we observe a sharp time reduction in running the
TestDFSIO-Write when scaling from 3 to 8 slaves; this reduction was quantified by 12.53%.
However, an expected average time was increased when scaling from 4 to 5 machines. Again,
this unexpected result can be explained by the overall network performance.
11.1.3. TestDFSIO- Read Performance on HPhC
Running TestDFSIO-Read led also to significant performance improvement when the
physical cluster was scaled up to 8 machines (Table 14 and Figure 42). In general, this
observation is applied for all dataset sizes.
71
Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and
different number of nodes- Hadoop Physical Cluster
Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster
When the cluster was scaled from 3 to 7 nodes, the average time for reading 100MB (Figure
43) was reduced by 4.36% and 2.46% when reading 1GB (Figure 44). However, when scaling
the cluster from 7 to 8 machines, the average time increased suddenly when reading both
100MB and 1GB. The same observation was made when reading 10GB and 100GB (Figure
45 and 46).
Figure 43: TestDFSIO-Read performance for
100 MB on Hadoop Physical Cluster Figure 44 : TestDFSIO-Read performance for 1
GB on Hadoop Physical Cluster
72
Figure 45: TestDFSIO-Read performance for
10 GB on Hadoop Physical Cluster Figure 46 : TestDFSIO-Read performance for
100 GB on Hadoop Physical Cluster
11.2.Hadoop Virtualized Cluster- KVM Results
11.2.1. TeraSort Performance on HVC-KVM
Running TeraSort on Hadoop KVM Cluster showed an important improvement in sorting
various dataset sizes. Yet, this observation is applied when scaling the KVM cluster from 3 to
5 VMs. The results we got from running this benchmark on Hadoop KVM Cluster are listed
in Table 15 and conceptualized in Figure 47.
Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different
number of nodes- Hadoop KVM Cluster
Figure 47: TeraSort performance on Hadoop KVM Cluster
73
From Figure 48, sorting 100MB on 3 VMs takes around 15 seconds, and it decreases by 2.2%
and 5.5% when sorting the dataset on 4 and 5 VMs respectively.
Figure 48: TeraSort performance for
100 MB on Hadoop KVM Cluster Figure 49 : TeraSort performance for 1 GB on
Hadoop KVM Cluster
When sorting 1GB, 10 GB and 30 GB (Figure 49, 50 and 51), the performance was slightly
improved as the number of VMs increases. For example, sorting time of 10GB was decreased
by 0.3%, and sorting time of 30 GB was decreased by 5% when scaling from 3 to 4 nodes.
However, when the cluster was scaled to 5, 6, 7 and 8 nodes, the overall performance of
sorting 1GB, 10 GB and 30 GB was sharply decreased.
Figure 50: TeraSort performance for
10 GB on Hadoop KVM Cluster Figure 51 : TeraSort performance for 30 GB on
Hadoop KVM Cluster
74
11.2.2. TestDFSIO-Write Performance on HVC-KVM
Running TestDFSIO-Write on Hadoop KVM was slightly improved as the number of VMs
increases. The results of running TestDFSIO-Write are listed in Table 16 and illustrated in
Figure 52.
Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and
different number of nodes- Hadoop KVM Cluster
Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster
For all dataset sizes (Figure 53, 54, 55 and 56), as stated before, the overall performance was
slightly improved as the number of VMs increased from 3, 4 and 5. For instance, writing
10GB was improved by 1.6% when scaling from 3 to 5 VM. Furthermore, when trying to
write 100GB, the system was crashed because of the overall system overhead (Figure 56).
75
Figure 53: TestDFSIO-Write performance for 100
MB on Hadoop KVM Cluster Figure 54 : TestDFSIO-Write performance for
1GB on Hadoop KVM Cluster
Figure 55: TestDFSIO-Write performance for
10 GB on Hadoop KVM Cluster Figure 56 : TestDFSIO-Write performance for
100 GB on Hadoop KVM Cluster
11.2.3. TestDFSIO- Read Performance on HVC-KVM
TestDFSIO- Read has the same behavior as TestDFSIO-Write. Meaning, the performance of
reading different dataset sizes increases as the number of VMs increases from 3 to 5. The
results we got from running TestDFSIO- Read are illustrated in Table 17 and Figure 57.
76
Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop KVM Cluster
Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster
As Figure 58, 59, 60 and 61 depict, the overall performance of reading different dataset sizes
increases as the number of VMs increases from 3 to 5. For example, the average time for
reading 100GB was slightly decreased by 3% when scaling from 3 to 5 VMs.
Figure 58: TestDFSIO-Read performance for 100
MB on Hadoop KVM Cluster Figure 59 : TestDFSIO-Read performance for
1GB on Hadoop KVM Cluster
77
Figure 60: TestDFSIO-Read performance for 10
GB on Hadoop VMware ESXi Cluster Figure 61 : TestDFSIO-Read performance for
100 GB on Hadoop VMware ESXi Cluster
11.3.Hadoop Virtualized Cluster- VMware ESXi Results
11.3.1. TeraSort Performance on HVC-VMware ESXi
Table 18 and Figure 62 present the performance of running TeraSort on Hadoop VMware
ESXi Cluster; the overall observation shows significant improvement in sorting various
dataset sizes. In contrast to KVM cluster, VMware ESXi keeps decreasing the average time
of storing as the number of VMs increases from 3 to 6 (for large datasets).
Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop VMware ESXi Cluster
Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster
78
As Figure 63 depicts, the performance of sorting 1 GB was decreased by 23% when scaling
the cluster from 3 to 6 VMs. Yet, the performance starts degrading as the number of VMs
increases from 6 to 7 and 8.
Figure 63: TeraSort performance for 100 MB on
Hadoop VMware ESXi Cluster Figure 64 : TeraSort performance for 1GB on
Hadoop VMware ESXi Cluster
A significant high performance was observed when sorting 30GB (Figure 66). The
performance was increased by 34% from 3 to 6 VMs, 25% from 3 to 7 VMs and 3% from 3 to
8 VMs.
Figure 65: TeraSort performance for 10 GB on
Hadoop VMware ESXi Cluster Figure 66 : TeraSort performance for 30GB
on Hadoop VMware ESXi Cluster
79
11.3.2. TestDFSIO-Write Performance on HVC-VMware ESXi
Running TestDFSIO-Write on Hadoop VMware ESXi was improved as the number of VMs
increases to 7. The results of running TestDFSIO-Write are listed in Table 19 and illustrated
in Figure 67.
Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and
different number of nodes- Hadoop VMware ESXi Cluster
Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster
For all dataset sizes (Figure 68, 69, 70 and 71), the overall performance was improved as the
number of VMs increases from 3 to 7. For instance, writing 100 MB was improved by 37%
when scaling from 3 to 7 VMs. Furthermore, when writing large dataset like 10GB, the
overall performance increased by 12% when scaling from 3 to 7 VMs. However, for the case
of 100GB, the performance started degrading when scaling from 6 to 7 and 8 VMs.
80
Figure 68: TestDFSIO-Write performance for 100
MB on Hadoop VMware ESXi Cluster Figure 69 : TestDFSIO-Write performance for
1GB on Hadoop VMware ESXi Cluster
Figure 70: TestDFSIO-Write performance for 10
GB on Hadoop VMware ESXi Cluster Figure 71 : TestDFSIO-Write performance for
100 GB on Hadoop VMware ESXi Cluster
11.3.3. TestDFSIO- Read Performance on HVC-VMware ESXi
TestDFSIO- Read behaves as TestDFSIO- Write when the performance of reading different
dataset sizes increases as the number of VMs increases from 3 to 7. However, the average
time for reading different datasets was less than writing operation (by more than half). The
results we got from running TestDFSIO- Read on VMware ESXi are listed in Table 20 and
conceptualized in Figure 72.
81
Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and
different number of nodes- Hadoop VMware ESXi Cluster
Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster
Figures 73, 74, 75 and 76 show the performance of running TestDFSIO-Read on each
individual dataset . For most dataset sizes, the performance was improved as the number
of VMs inreased up to 7. For instance, the performance of reading 100GB was improved
by 36% when scaling from 3 to 7 VMs. However, reading 1GB behavied differently as the
correspondding performance started to decline at VM 6.
Figure 73: TestDFSIO- Read performance for
100 MB on Hadoop VMware ESXi Cluster Figure 74 : TestDFSIO-Read performance for 1
GB on Hadoop VMware ESXi Cluster
82
Figure 75: TestDFSIO- Read performance for
10 GB on Hadoop VMware ESXi Cluster Figure 76 : TestDFSIO-Read performance for
100 GB on Hadoop VMware ESXi Cluster
11.4. Results Comparison
11.4.1. TeraSort Performance
The overall performance of the 3 clusters varies depending on the datasets size and the
number of nodes involved in each cluster. Yet, Hadoop VMware ESXi cluster was performing
much better than other clusters when running TeraSort benchmark on large datasets.
Starting with 100MB (Figure 77), TeraSort showed high performance when being virtualized
with VMware ESXi and KVM. Both clusters were 25% (VMware ESXi) and 30% (KVM)
faster than the physical cluster (in case of 3 nodes). Further, a significant performance was
achieved when scaling the cluster to 4, 5 and 6 nodes; in this case, both KVM and VMware
ESXi were faster than the physical cluster.
After increasing the number of nodes to 7 and 8, VMware ESXi performance decreases by
33% and becomes slower than the physical cluster by 18% (when scaling from 3 to 8 nodes).
On the other hand, the average time of sorting 100MB dataset on KVM cluster declined as the
number of nodes increases to 7 and 8, and therefore, the sorting performance was improved
from 15 to 14 seconds. Further, virtualized cluster (KVM) was performing better than the
physical cluster by 29.5% and 27% for 7 and 8 nodes respectively.
83
Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and
VMware ESXi
When increasing the dataset size, the performance changes in each scenario (dataset size and
number of nodes). In the case of 1GB (Figure 78), virtualized cluster was keeping the best
performance when compared with the physical cluster. When the cluster was composed of 3-5
nodes, virtualized clusters sort the 1GB dataset with a range of 87-90 seconds, while the
physical cluster sorts the same dataset with a range of 182-187 seconds. When increasing the
number of nodes from 5 to 8, VMware ESXi was faster than other clusters; however, KVM
knew a decline in its performance when being compared with KVM cluster of 3-4 nodes and
when being compared with the physical cluster. For instance, in the case of 8 machines,
physical cluster was faster than KVM cluster by 89%.
Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi
84
The same observation on 1GB can be applied when sorting 10GB dataset (Figure 79). Yet, in
this case, the performance of virtualized clusters was very high than the physical cluster. For
instance, in the case of 5 VMs, VMware ESXi cluster was faster than physical cluster by 60%,
and KVM was faster than physical cluster by 51%.
Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and
VMware ESXi
When moving to larger datasets, VMware ESXi cluster proved its significant performance in
sorting the 30 GB dataset (Figure 80). For instance, in the case of 4 nodes, VMware ESXi
was faster than KVM cluster by 28% and faster than physical cluster by 61%. Moreover,
KVM was performing better than physical cluster when the cluster was composed of 3, 4, 5
and 6 nodes. Afterward, when increasing the cluster size to 7 and 8 nodes, the KVM cluster
decreased in its performance and became slower than the physical cluster.
Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and
85
VMware ESXi
The last observation consists in VMware ESXi performance on 8 nodes cluster. For all
different datasets, we observed that VMware performance degraded; for example, for 10 GB,
the performance decreased by 51%. Even though, VMware ESXi kept performing better than
other clusters.
11.4.2. TestDFSIO- Write Performance
The results we got from TestDFSIO were different than the ones in TeraSort benchmark. The
overall observation of Figure 81 and 82 shows that virtualization is still performing better
than the physical cluster.
In the case of 3-5 nodes cluster, we can observe that KVM cluster performance is much
better than VMware ESXi and physical cluster. For instance, writing 100 MB using 5 nodes,
KVM cluster was 11% faster than physical cluster and 24% faster than VMware ESXi cluster
(Figure 81). However, we observed that physical cluster was performing better than VMware
ESXi, and the difference was quantified by 48% seconds (100 MB using 5 nodes).
When scaling the cluster from 5 to 8 nodes, the KVM cluster knew sharp performance
degradation. Again, this is due to system overhead. In this case, the physical cluster showed
better results than virtualized clusters.
Figure 81: Average time for writing 1 GB on
HPhC, HVC with KVM and VMware ESXi Figure 82 : Average time for writing 10 GB on
HPhC, HVC with KVM and VMware ESXi
The same observation is applied when sorting 100 GB (Figure 83). The only difference is that
KVM cluster with 8 nodes was unable to write the 100 GB.
86
Figure 83: Average time for writing 10 GB on
HPhC, HVC with KVM and VMware ESXi Figure 84 : Average time for writing 100 GB on
HPhC, HVC with KVM and VMware ESXi
11.4.3. TestDFSIO- Read Performance
As illustrated in Figure 84 and 85, reading small datasets (100MB and 1GB) showed that
virtualized cluster is faster than physical cluster. Yet, this applied for KVM cluster when it is
composed of 3-5 nodes. Afterwards, when KVM clusters scaled to 6, 7 and 8 nodes, the
performance of reading all datasets degraded. On the other hand, physical cluster performed
better than VMware ESXi in all case (100MB and 1G on different number of nodes).
Figure 85: Average time for reading 1 GB on HPC,
HVC with KVM and VMware ESXi Figure 86 : Average time for reading 1 GB on
HPC, HVC with KVM and HVC VMware ESXi
When increasing the dataset size to 10 GB and 100GB (Figure 86 and 87), we can see
different performance trends. When the cluster is composed of 3-5 nodes, KVM cluster kept
better performance than other clusters. For instance, for 100 GB and 3 nodes, KVM cluster
87
was faster than VMware ESXi by 12% and faster than physical cluster by 44%. However, as
other benchmarks (TeraSort and TestDFSIO-Write), KVM cluster showed a sharp
degradation in reading 100GB when the cluster scaled to 6, 7 and 8 nodes. When reading
10GB and 100 GB, in contrast to TestDFSIO-Write results, VMware ESXi cluster was faster
than physical cluster in all scenarios (number of nodes). For instance, using a cluster of 3
nodes; VMware ESXi was faster than the physical cluster by 36% and 55.5% in the case of 7
and 8 nodes respectively.
An important observation is that KVM cluster with 8 VMs was unable neither to write nor to
read 100GB dataset (Figure 87).
Figure 87: Average time for reading 10 GB on
HPhC, HVC with KVM and HVC VMware ESXi Figure 88 : Average time for reading 100 GB on
HPhC, HVC with KVM and HVC VMware ESXi
88
Chapter 12: Discussion
The results we got in this research proved significant improvements when virtualizing HPC,
especially when the latter was tested with TeraSort benchmark; in this case, we found that
both virtualized clusters (KVM and VMware ESXi) have better performance than physical
cluster.
12.1.TeraSort Performance
When running TeraSort benchmark, VMware ESXi cluster proved to have fast sorting of
large datasets starting from 1GB, 10 GB and 30 GB. For instance, sorting 30GB using a
cluster of 4 nodes showed that VMware ESXi is faster than KVM by 64% and faster than
physical cluster by 84% (Figure 80). Concerning the KVM cluster, it was also proved to be
faster than the physical cluster. However, when the number of nodes increases in virtualized
clusters, the performance of TeraSort degraded significantly.
In the case of KVM cluster, when the number of nodes increases to 6, 7 and 8, the overall
performance of running TeraSort became slower. In fact, the reason behind facing this
degradation is explained by the system overhead, especially disk overhead. A study was done
in [92] performed an analysis of KVM scalability in OpenStack platform, and it state that
KVM is not recommended to be used when many virtual hard disks will be accessed at the
same time. Therefore, since TeraSort has both computational and I/O jobs, KVM VMs
affected the overall performance when they were scaled to 6, 7 and 8. Moreover, another
study was done in [93] states that KVM has substantial problems with guests crashing when it
reaches a certain number of VMs (4 for this study [93]); hence, scalability is considered an
issue for system overhead when using KVM virtualization.
In the case of VMware ESXi cluster, the performance of running TeraSort declines when the
cluster was scaled to 8 nodes. The same as KVM, the reason is due to system overhead.
However, the system overhead is not related to scalability issue because VMware ESXi is
known to be scalable [94]. To make sure from the cause that led to system overhead, we
tracked the performance of sorting 30GB dataset on 8 VMware ESXi VMs (using VMware
vSphere Client), and we found that, at some point, the memory required to sort the dataset
exceeds the available memory offered by the cluster. This can be observed in Figure 88 which
illustrates that active memory (in red, memory currently consumed by VMs) is higher than the
granted memory (in grey, memory provided by the hosting hardware) between 5:05 and 5:10
89
PM range. Another proof that confirms the system overhead is the latency rate; in this case,
we tracked the latency of running 30 GB on 8 VMs, and we observed that system latency
reaches its peak (Figure 89) when sorting this dataset. Thus, latency impacts the overall
performance when the number of VMs increases to 8. The last proof was reported by
OpenStack Dashboard (Figure 90) which showed warning state of resources usage after
creating 8 VMware ESXi instances. In short, VMware ESXi cluster performance declines at 8
VMs because of resources shortage.
Figure 89: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs
Figure 90 : System latency reaches its peak (at 12.28PM) when running 30 GB on 8 VMware ESXi
VMs
90
Figure 91: OpenStack warning statistics about system’ resources usage
In short, Even if TeraSort performance decreases when the number of VMware ESXi VMs
increases to 8, the results we got still confirm that Hadoop VMware ESXi cluster is better
than Hadoop KVM Cluster and Hadoop Physical Cluster.
12.2.TestDFSIO Performance
The performance behavior of each cluster changed when running TestDFSIO benchmark. For
all dataset sizes, KVM cluster proved to have high performance than other clusters when
performing both TestDFSIO-Write and TestDFSIO-Read (Figures 81-87). On the other hand,
VMware ESXi showed the lowest performance when compared to KVM and physical cluster.
In fact, the reason that explains the good results we got from running TestDFSIO on KVM is
related to virtio API. The latter is integrated in KVM hypervisor to provide an efficient
abstraction for I/O operations [95]. Virtio was studied in [96] and proved that it enhances
KVM performance at I/O operations; the authors of this study [96] tested the performance of
KVM (with virtio API) at I/O operations and compared it with VMware vSphere 5.1
performance. They concluded that KVM with virtio API achieves I/O rates that are 49%
higher than VMware vSphere 5.1.
When running TestDFSIO, we observed again that the performance of both virtualized
clusters decreases as the number of VMs goes beyond 6 (KVM) and 7 (VMware ESXi).
91
12.3.Conclusion
Brief, the overall performance of TeraSort and TestDFSIO proved that, first, virtualization has
better performance than physical cluster, and, second, the selection of underlying
virtualization technology can lead to significant improvements when performing HPCaaS.
Therefore, in this research, VMware ESXi proved to have the best performance especially
when running computational jobs (TeraSort).
To deal with the issue of system overhead in virtualized clusters, HPCaaS needs to be run in a
cloud environment that has balanced number of VMs. For this research, the reasonable
number that provides high performance was 7 VMs for VMware ESXi and 5 VMs for KVM
cluster.
92
Part IV: Conclusion
This part summarizes the research objectives and findings and suggests some related future
work. Bibliography of this report is listed after the conclusion, and finally, a set of appendices
(OpenStack Documentation, Hadoop Documentation, Benchmarks Execution and Data
Gathering) are provided at the end of this report.
93
Chapter 13
Conclusion and Future Work
This project aimed at demonstrating the impact of running HPCaaS on different virtualization
technologies, namely, KVM and VMware ESXi cluster.
For that, we have built three main Hadoop clusters: Hadoop Physical Cluster, Hadoop
Virtualized Cluster with KVM and Hadoop Virtualized Cluster with VMware ESXi. For
virtualized clusters, we proposed to build Hadoop cluster on top of OpenStack platform. On
each cluster, we run two known benchmarks: TeraSort and TestDFSIO. Each benchmark was
tested on different dataset sizes and on different number of machines (from 3 to 8 machines).
To ensure the credibility and reliability of the research, we performed three tests on each
scenario; for instance, we tested TeraSort for 30GB on each cluster three times, and then we
took the mean to avoid any outliers.
The findings of this research clearly demonstrate that vitalized clusters can perform much
better than physical cluster when processing and handling HPC, especially when there is less
overhead on the virtualized cluster. We found that Hadoop VMware ESXi cluster performs
better at sorting big datasets (more computations), and Hadoop KVM cluster performs better
at I/O operations.
Finally, this report includes detailed installation guides of OpenStack and Hadoop that will
save time and facilitate the work for future students who want to work on related research.
As future work, the possibilities for extending this research can go in different directions. The
first proposed work is to conduct the research’ experiments using real HPC applications that
can show precisely the impact of virtualization on HPCaaS. The second proposed future work
is to conduct this research using other emerging virtualization technologies such as XEN, and
Hyper-V. Third proposed future work is to see the impact of cloud platforms on improving
the HPCaaS; meaning, another research can be conducted to see for example, if replacing
OpenStack with another cloud infrastructure can lead to better results. Finally, since we got
positive results about the impact of visualization on HPCaaS, this research can be investigated
more by integrating its findings in other domains such as Smart Grid.
94
Bibliography
[1] J. Gantz and D. Reinsel, “The Digital Universe in 2020: Big Data, Bigger Digital
Shadows, and Biggest Growth in the Far East”, IDC IVIEW, pp. 1-16, 2012
[2] Gartner, Inc., “Hunting and Harvesting in a Digital World”, in Gartner CIO Agenda
Report, pp. 1-8, 2013
[3] Amazon Web Services, “High Performance Computing (HPC) on AWS”,
http://aws.amazon.com/hpc-applications/
[4] J. Gantz and D. Reinsel, “The Digital Universe Decade – Are You Ready?”, IDC IVIEW,
pp. 1-15, 2010
[5] Ch.Vecchiola1, S. Pandey, R.Buyya, “High-Performance Cloud Computing: A View of
Scientific Applications”, in the 10th International Symposium on Pervasive Systems,
Algorithms and Networks I-SPAN, IEEE Computer Society, pp. 4-16, 2009
[6] J. Dean and S. Ghemawat, “MapReduce: Simple Data Processing on Large Clusters”, in
OSDI, pp. 1-12. 2004
[7] Hadoop: http://hadoop.apache.org/
[8] S. Krishman, M. Tatineni, and C. Baru, “MyHaddop – Hadoop-on-Demand on Traditional
HPC Resources”, in the National Science Foundation’s Cluster Exploratory Program, pp.
1-7, 2011
[9] E. Molina-Estolano, M. Gokhale, C. Maltzahn1, J. May, J. Bent, S. Brandt, “Mixing
Hadoop and HPC Workloads on Parallel Filesystems”, in the 4th Annual Workshop on
Petascale Data Storage, pp. 1-5, 2009
[10] C. Cranor, M. Polte, and G. Gibson, “HPC Computation on Hadoop Storage with
PLFS”, Parallel Data Laboratory at Carnegie Mellon University, pp. 1-9, 2012
[11] Y. Xiaotao, L. Aili, and Z. Lin, “Research of High Performance Computing with
Clouds”, in the Third International Symposium on Computer Science and Computational
Technology (ISCSCT), pp. 289-293, 2010
[12] KVM:http://www.linux-kvm.org/page/Main_Page
[13] VMware ESXi: http://www.vmware.com/
[14] D. Boulter, “Simplify Your Journey to the Cloud”, Capgemini and SOGETI, pp. 1-
8, 2010.
[15] P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, National Institute of
Standards and Technology, pp. 1-3, 2011
[16] A. E. Youssef, “Exploring Cloud Computing Services and Applications”, Journal of
Emerging Trends in Computing and Information Sciences, vol. 3, no. 6, pp. 838-
847, 2012
[17] T. Korri, “Cloud Computing: Utility Computing over the Internet”, Seminar on
95
Internetworking, pp. 1-5, 2009
[18] ISACA, “Cloud Computing: Business Benefits with Security, Governance and Assurance
Perspectives”, pp. 1-10, 2009
[19] A. T. Velte, T. J. Velte, R. C. Elsenpeter,”Cloud Computing, A practical approach”,1st
ed., USA : McGraw-Hills, 2009
[20] Amazon Web Services: http://aws.amazon.com/
[21] Google Cloud Platform: https://cloud.google.com/
[22] Microsoft Cloud Services: http://www.microsoft.com/enterprise/it- trends/cloud-
computing/default.aspx?Search=true#fbid=33S2kMNT99z
[23] Open Source Software for Building Private and Public Clouds:
http://www.openstack.org
[24] I. Menken, and G. Blokdijk, “Cloud Computing Virtualization Specialist Complete
Certification Kit - Study Guide Book and Online Course”, Emereo Pty Ltd, 2009
[25] M. Portnoy, Virtualization Essentials, John Wiley & Sons, 2012
[26] K. Scarfone, M. Souppaya, and P. Hoffman, “Guide to Security for Full Virtualization
Technologies”, National Institute of Standards and Technology, 2011
[27] D. Dale, “Server and Storage Virtualization with IP Storage”, Storage Networking
Industry Association (SNIA), 2008
[28] D. Marinescu and R. Kroger; “State of the Art in Autonomic Computing and
Virtualization”, Wiesbaden University of Applied Sciences, pp. 1-21,2007
[29] K. Koganti, E. Patnala2, S. Narasingu, J. Chaitanya,Virtualization Technology in Cloud
Computing Environment, in International Journal of Emerging Technology and Advanced
Engineering, vol. 3, no. 3, 2013
[30] N. Susanta and T. Chiueh, “A Survey on Virtualization Technologies”, Department of
Computer Science at Stony Brook, 2006
[31] Virtualization: A Key to Virtualization World: http://isa.unomaha.edu/wp-
content/uploads/2012/08/Virtualization.pdf
[32] “Virtualization Overview”, white paper, VMware, 2006
[33] N. Alam, “Survey on Hypervisors”, School of Informatics and Computing at Indiana
University, 2011
[34] C. D. Graziano, “A Performance Analysis of Xen and KVM Hypervisors for Hosting the
Xen Worlds Project”, Digital Repository at Iowa State University, pp. 12-39, 2011
[35] N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM”, Master
Thesis, pp. 30-44, 2012
[36] “How Does Xen Work?”, white paper, Xen, 2009
[37] O. Kulkarmi, N. Xinli, and P. K. Swamy, “Cutting-Edge Perspective of Security
Analysis for Xen Virtual Machines”, International Journal of Engineering Research and
96
Development, vo. 2, no. 3, pp. 40-45, 2012
[38] T. Hirt, “KVM – The Kernel-based Virtual Machine”, Red Hat Inc., 2010
[39] M. T. Jones, “Anatomy of a Linux Hypervisor”, IBM Corporation, 2009
[40] “VMware ESXi 5.0 Operations Guide”, white paper, VMware, 2011
[41] M. K. Kakhani, S. Kakhani, and S. R. Biradar, “Research Issues in Big Data Analytics,”
Vol. 2, No. 8, pp. 228–232, 2013
[42] C. Hagen, “Big Data and the Creative Destruction of Today’s”, ATKearney, 2012
[43] “Oracle : Big Data for the Enterprise”, white paper, Oracle Corp., 2013
[44] “Oracle NoSQL Database”, white paper, Oracle Corp., 2011
[45] S. Yu, “ACID Properties in Distributed Databases”, Advanced eBusiness Transactions
for B2B-Collaborations, 2009
[46] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available,
partition-tolerant web services,” ACM SIGACT News, vol. 33, no. 2, p. 51, 2002
[47] A. Lakshman, P. Malik, “Cassandra - A Decentralized Structured Storage System”, ACM
SIGOPS Operating Systems Review, vol. 44, no.2, pp. 35-40, 2010
[48] G. Lars., “Introduction,” in HBase: The Definitive Guide, USA: O'Reilly Media, 2011
[49] MongoDB: http://www.mongodb.org/
[50] Apache CouchDB™: http://couchdb.apache.org/
[51] J.Bernstein, K. McMahon, “Computing on Demand—HPC as a Service: High
Performance Computing for High Performance Business”, white paper, Penguin Computing
& McMahon Consulting.
[52] Y. Xiaotao, L. Aili, Z. Lin, “Research of High Performance Computing With Clouds,”
International Symposium Computer Science and Computational Technology, pp. 289–
293, 2010
[53] Self-service POD Portal: http://www.penguincomputing.com/services/hpc-
cloud/pod
[54] Amazon Cloud Storage: http://aws.amazon.com/ec2/reserved-instances/
[55] Amazon Cloud Drive: http://aws.amazon.com/ec2/spot-instances/
[56] Microsoft High Performance Computing for Developers:
http://msdn.microsoft.com/en-us/library/ff976568.aspx
[57] Google Cloud Storage: https://cloud.google.com/products/compute-engine
[58] S. Zhou, B. Kobler, D. Duffy, and T. McGlynn, “Case Study for Running HPC
Applications in Public Clouds”, in Science Cloud '10, 2012
[59] K. R. Jackson, “Performance Analysis of High Performance Computing Applications on
the Amazon Web Services Cloud”, in Cloud Computing Technology and Science
97
(CloudCom), 2010 IEEE Second International Conference on, pp. 159-168, 2010
[60] E. Walker, “Benchmarking Amazon EC2 for High-Performance Scientific Computing”,
Texas Advanced Computing Center at the University of Texas, pp. 18-23, 2008
[61] J. Ekanayake and G. Fox, “High Performance Parallel Computing with Clouds and Cloud
Technologies”, School of Informatics and Computing at Indiana University, pp. 1-
20, 2009.
[62] Y. Gu and R. L. Grossman, “Sector and Sphere: The Design and Implementation of a
High Performance Data Cloud”, National Science Foundation, pp. 1-11, 2008
[63] A. Gupta and D. Milojicic, “Evaluation of HPC Applications on Cloud”, Helwett-
Packard Development Company, pp. 1-6, 2011
[64] C. Evangelinos and C. N. Hill. “Cloud Computing for parallel Scientific HPC
Applications: Feasibility of running Coupled Atmosphere-Ocean Climate Models on
Amazon’s EC2”, Department of Earth, Atmospheric and Planetary Sciences at
Massachusetts Institute of Technology, pp. 1-6, 2009
[65] “Dryad and DryadLINQ for Data Intensive Research”:
http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx
[66] C. Fragni, M. Moreira, D. Mattos, L. Costa, and O. Duarte, “Evaluating Xen, VMware,
and OpenVZ Virtualization Platforms for Network Virtualization”, Federal University of
Rio de Janeiro, pp. 1-1, 2010
[67] N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM”, Master
Thesis, pp. 30-44, 2012
[68] T. Deshane, M. Ben-Yehuda, A. Shah, and B. Rao, “Quantitative Comparison of Xen
and KVM”, in Xen Summit, pp. 1-3, 2008
[69] J. Hwang, S. Wu, and T. Wood, “A Component-Based Performance Comparison of Four
Hypervisors”, George Washington University and IBM T.J. Watson Research Center, pp.
1-8, 2012
[70] A. J. Younge, R. Henschel, J. T. Brown, G. Laszewski, J. Qiu, and G. C. Fox, “Analysis
of Virtualization Technologies for High Performance Computing Environments”,
Pervasive Technology Institute, pp. 1-8, 2012
[71] Q. Jiang. “Open Source Iaas Community Analysis”, Eucalyptus Systems Inc., 2012
[72] I. Voras, M. Orlic, and B. Mihaljevié, “An Early Comparison of Commercial and Open-
Spurce Cloud P¨latforms for Scientific Environments”, University of Zagreb Faculty of
Electrical Engineering and Computing, Zagreb, Croatia, 2012
[73] E. Caron, L. Toch, and J. Rouzaud-Cornabas, “Performance Comparison between
OpenStack and OpenNebula and the multi-Cloud Architecture: Application to
Cosmology”, Research Report N° 8421, 2013
[74] K. Kostantos, A. Kapsalis, D. Kyriazis, M. Themistocleous, and P. Cunha, “Open-Source
IAAS Fit for Purpose: A Comparison between Openbula and OpenStack”, International
Journal of Electronic Business Management, Vol. 11, No. 3, 2013
98
[75] O. Sefraoui, M. Aissaoui, and M. Eleuldj, “Comparison of Multiple IaaS Cloud Platform
Solutions”, Mohamed I University, 2012
[76] “Donnie Berkholz’s Story of Data3:
http://redmonk.com/dberkholz/2012/03/26/nosql-database-popularity-according-to-
jaspersoft/
[77] E. Dede, M. Govindaraju, D. Gunter, R. Canon, and L. Ramakrishnan, “Performance
Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis”, SUNY
Binghamton and Lawrence Berekely National Lab, 2012
[78] J. H. Lee, “Log Analysis System Using Hadoop and MongoDB”, CUBRID, 2012.
[79] OpenStack: http://www.openstack.org/
[80] “OpenStack Training Guides”, white paper, OpenStack Foundation, 2013
[81] A. Sehgal, “Introduction to OpenStack: Running a Cloud Computing Infrastructure with
Openstack”, in the 6th International Conference on Autonomous Infrastructure,
Management and Security, University of Luxembourg, 2012
[82] K. Pepple, Deploying OpenStack, O'Reilly Median, 2011
[83] OpenStack, “Companies Supporting the OpenStack Foundation”,
http://www.openstack.org/foundation/companies/
[84] G. Sasiniveda and N. Revathi, “Data Analysis using Mapper and Reducer with Optimal
Configuration in Hadoop", International Journal of Computer Trends and Technology,
vol. no. 3, 2013
[85] D. Borthakur, “The Hadoop Distributed File System: Architecture and Design”, The
Apache Software Foundation, 2007
[86] T. White, Hadoop: The Definitive Guide, O'Reilly Media, 2010
[87] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File
System”, Sunnyvale, 2010
[88] H. Herodotu, “Hadoop Performance Models”, Computer Science Department at Duke
University, 2011
[89] Blogclub Tworkshops,”Hadoop and MapReduce”,
http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/
[90] M. G. Noll, “Benchmarking and Stress Testing and Hadoop Cluster with TeraSort, Test
DFSIO & Co.”, 2011
[91] Apache Hadoop, “TestDFSIO Apache Hadoop Code Source”,
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-
mapreduce- client-jobclient/0.23.9/org/apache/hadoop/fs/TestDFSIO.java
[92] F. Rahma*, T. Adji, Widyawan, “Scalability Analysis of KVM-Based Private Cloud For
IaaS”, in International Journal of Cloud Computing and Services Science, Vol.2, No.4,
ppt. 288-295, 2013
99
[93] T.Deshane, M. Yehuda, A. Shah, B. Rao, “Quantitative Comparison of Xen and KVM”, in
Journal of Physics: Conference, 2010
[94] “Virtualizing Resource intensive Applications”, white paper, VMware, 2009
[95] “Scale-up Virtualization with Red Hat Enterprise Linux 5.4 on an HP ProLiant DL785
G6”, white paper, Redhat, 2009
[96] “KVM Virtualized I/O Performance”, white paper, IBM & Redhat, 2013.
100
Appendix A: OpenStack with KVM Configuration
Pre-configuration
1. Update your machine
sudo apt-get update
sudo apt-get upgrade
2. Install bridge-utils
sudo apt-get install bridge-utils
3. NTP Server
3.1. Install the NTP Server
sudo apt-get install ntp
3.2. Open the file /etc/ntp.conf
Add the following lines to make sure that the time on the server stays in sync with an external
server.
server ntp.ubuntu.com
server 127.127.1.0
fudge 127.127.1.0 stratum 10
3.3.Restart NTP Service
sudo service ntp restart
4. Network Configuration
As public IP address changes periodically, you need to set a static IP address that will be used
in OpenStack configuration. In this case, we have two network interfaces eth0 and eth1. Eth0
was chosen as the network management; as a result, this interface was set to static IP address
(in this guide, we used 10.60.62.12 as an IP management).
101
Hypervisor Configuration
1. KVM Configuration
If you want to install OpenStack with KVM hypervisor, then you need to follow the following
steps:
1.1.Check if your machine supports virtualization
ouidad@ouidad:~$ egrep -c '(vmx|svm)' /proc/cpuinfo
8
ouidad@ouidad:~$
If the output is 0, then your machine does not support virtualization; otherwise, if the output
is greater than 0, the machine support virtualization technology.
1.2. Check if KVM can be supported
ouidad@ouidad:~$ kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used
ouidad@ouidad:~$
If the output is as shown above, then your machine supports KVM virtualization.
1.3.Install KVM and libvirt
sudo apt-get install kvm libvirt-bin
1.4.KVM configuration
You can check the following website to configure the necessary files for KVM support:
https://help.ubuntu.com/community/KVM/Installation
1.5 Reboot your machine
102
OpenStack Databases Configuration
1. MySQL
1.1.Install Mysql server and related packages
sudo apt-get install mysql-server python-mysqldb
1.2.Create the root password for MySQL
The password used in this guide is "secret"
1.3.Open /etc/mysql/my.cnf
Change the bind address from bind-address=127.0.0.1 to bind-address = 0.0.0.0
1.4. Restart MySQL server
sudo restart mysql
2. Nova Database
2.1. Create Nova database “nova”
sudo mysql -uroot -psecret -e 'CREATE DATABASE nova;'
2.2.Create nova user named “novadbadmin”
sudo mysql -uroot -psecret -e 'CREATE USER novadbadmin;'
2.3.Grant all privileges for novadbadmin on the database "nova"
sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON nova.* TO 'novadbadmin'@'%';"
2.4. Create a password for the user "novadbadmin"; the password in this case is “secret”
sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'novadbadmin'@'%' = PASSWORD ('novasecret');"
3. Glance Database
3.1.Create glance database named “glance”
sudo mysql -uroot -psecret -e 'CREATE DATABASE glance;'
103
3.2.Create a user named “glancedbadmin”
sudo mysql -uroot -psecret -e 'CREATE USER glancedbadmin; '
3.3. Grant all privileges for glancedbadmin on the database "glance"
sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON glance.* TO 'glancedbadmin'@'%';"
3.4. Create a password for the user "glancedbadmin"
sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'glancedbadmin'@'%' = PASSWORD('glancesecret');"
4. Keystone Database
4.1.Create a database named “keystone”
sudo mysql -uroot -psecret -e 'CREATE DATABASE keystone;'
4.2.Create a user named “keystonedbadmin”.
sudo mysql -uroot -psecret -e 'CREATE USER keystonedbadmin;'
4.3. Grant all privileges for keystonedbadmin on the database "keystone".
sudo mysql -uroot -psecret -e "GRANT ALL PRIVILEGES ON keystone.* TO 'keystonedbadmin'@'%';"
4.4.Create a password for the user "keystonedbadmin"
sudo mysql -uroot -psecret -e "SET PASSWORD FOR 'keystonedbadmin'@'%' = PASSWORD('keystonesecret');"
104
Keystone Configuration
1. Install Keystone
sudo apt-get install keystone python-keystone python-keystoneclient
2. Open /etc/keystone/keystone.conf
Make the following changes:
Change admin_token = ADMIN to admin_token = admin
Change connection = sqlite:////var/lib/keystone/keystone.db
to connection = mysql://keystonedbadmin:[email protected]/keystone
3. Restart keystone
sudo service keystone restart
4. Create glance schema in MySQL databas
sudo keystone-manage db_sync
5. Export environment variables
export SERVICE_ENDPOINT="http://localhost:35357/v2.0"
export SERVICE_TOKEN=admin
Note: you can also add these variables to ~/.bashrc as to avoid exporting them each time.
6. Create tenants
Create admin and service tenants
keystone tenant-create --name admin
keystone tenant-create --name service
7. Create users
Create OpenStack users by executing the following commands. In this case, we are creating
four users - admin, nova, glance and swift
keystone user-create --name admin --pass admin --email [email protected]
keystone user-create --name nova --pass nova --email [email protected]
keystone user-create --name glance --pass glance --email [email protected]
keystone user-create --name swift --pass swift --email [email protected]
105
8. Create roles
Create the roles by executing the following commands. In this case, we are creating two roles
- admin and Member.
keystone role-create --name admin
keystone role-create --name Member
Sample output:
9. List tenants, users and roles
keystone tenant-list
keystone user-list
keystone role-list
Sample output:
106
10. Adding roles to users in tenants
10.1. Add the role of “admin” to the user “admin” of the tenant “admin” keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2
--role 8af19783ac784e0397e0346c7f1ec --tenant_id ee14adbd1ac84445921
819cf7a5b7f5f
10.2. Add the role of “admin” to the user “nova” of the tenant ’service’.
keystone user-role-add --user 5ce6dd40bf2249e5ab35a95da63d7930
--role 8af19783ac784e0397e0346c7f1ec
--tenant_id 11824c8169924b098f41dae1fa726c6
10.3. Add the role of “admin” to the user “glance” of the tenant ’service’.
keystone user-role-add --user 9967843ee4aa421189f3382849700cad
--role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d
ae1fa726c6
10.4. Add the role of “admin” to the user “swift” of the tenant ’service’.
keystone user-role-add --user 24979d9ac31e4b83a58a89c1ad842ffa
--role 8af19783ac784e0397e0346c7f1ec --tenant_id 11824c8169924b098f41d
ae1fa726c6
10.5. The ’Member’ role is used by Horizon and Swift. So add the ’Member’ role
accordingly. (user: admin , role: Member , tenant: admin)
keystone user-role-add --user 4e77ea930bf944efadfb79f5fc8789a2 --role
c2860fd6f3fd4538a07161bdb2691f60 --tenant_id ee14adbd1ac84445921
819cf7a5b7f5f
11. Create services
Create the required services which the users can authenticate with: nova-compute, nova-
volume, glance, swift, keystone and ec2 are some of the services that we create.
11.1.Nova Compute Service
keystone service-create --name nova --type compute --description 'Opensatck Compute Service'
107
11.2.Volume Service
keystone service-create --name volume --type volume --description 'OpenStack Volume Service'
11.3.Image Service
keystone service-create --name glance --type image --description 'Openstack Image Service'
11.4. Object Store Service
keystone service-create --name swift --type object_store --description 'Openstack Storage Service'
11.5.Identity Service
keystone service-create --name keystone --type identity --description 'Openstack Identity Service'
11.6.EC2 Service
keystone service-create --name ec2 --type ec2 --description 'EC2 Service'
12. List keystone service list
keystone service-list
Sample output:
108
13. Create endpoints
Create endpoints for each of the services that have been created above (service id is displayed
using keystone service-list command).
13.1. Endpoint for identity service
keystone endpoint-create --region RegionOne --service_id
207bf81ddfe1481aa242148f246d091f --publicurl http://localhost:5000/v2.0 --internalurl
http://localhost:5000/v2.0 --adminurl http://localhost:35357/v2.0
13.2.Endpoint for nova service
keystone endpoint-create --region RegionOne --service_id
72b9d125eaa84aaf9c8ce734027eea21 --publicurl 'http://localhost:8774/v2/%(tenant_id)s' --
internalurl 'http://localhost:8774/v2/%(tenant_id)s' --adminurl
'http://localhost:8774/v2/%(tenant_id)s'
13.3.Endpoint for the image service
keystone endpoint-create --region RegionOne --service_id
581f6a8e337642a0a39090ffe6947e2d --publicurl 'http://localhost:9292/v1' --internalurl
'http://localhost:9292/v1' --adminurl 'http://localhost:9292/v1'
13.4.Define the EC2 compatibility service:
keystone endpoint-create --region RegionOne --service_id
4b1619d4f9f34cc9aaf473282c2340f0 --publicurl http://localhost:8773/services/Cloud --
internalurl http://localhost:8773/services/Cloud --adminurl http://localhost:8773/services/Admin
13.5.Endpoint for the Volume service
keystone endpoint-create --region RegionOne --service_id
6afe27a1768b403b9521418a87646ec4 --publicurl 'http://localhost:8776/v1/%(tenant_id)s' --
internalurl 'http://localhost:8776/v1/%(tenant_id)s' --adminurl
'http://localhost:8776/v1/%(tenant_id)s'
13.6.Endpoint for object storage service
keystone endpoint-create --region RegionOne --service_id
2ec242420a114671a4fe15e745b45d3f --publicurl
'http://localhost:8888/v1/AUTH_%(tenant_id)s' --adminurl 'http://localhost:8888/v1' --
internalurl 'http://localhost:8888/v1/AUTH_%(tenant_id)s'
109
Glance Configuration
1. Install Glance packages
sudo apt-get install glance glance-api glance-client glance-common glance-registry python-glance
2. Open /etc/glance/glance-api-paste.ini
Change the following lines:
admin_tenant_name = %SERVICE_TENANT_NAME%
admin_user = %SERVICE_USER%
admin_password = %SERVICE_PASSWORD%
By:
admin_tenant_name = service
admin_user = glance
admin_password = glance
3. Now open /etc/glance/glance-registry-paste.ini
Change the following lines:
admin_tenant_name = %SERVICE_TENANT_NAME%
admin_user = %SERVICE_USER%
admin_password = %SERVICE_PASSWORD%
By:
admin_tenant_name = service
admin_user = glance
admin_password = glance
4. Open the file /etc/glance/glance-registry.conf
Change the line which contains the option "sql_connection =" to this:
sql_connection = mysql://glancedbadmin:[email protected]/glance
Add the following lines at the end of the file as to allow glance to use keystone for
authentication.
[paste_deploy]
flavor = keystone
110
5. Open /etc/glance/glance-api.conf
Add the following lines at the end of the file.
[paste_deploy]
flavor = keystone
6. Create glance schema in MySQL database
sudo glance-manage version_control 0
sudo glance-manage db_sync
7. Restart glance-api and glance-registry
sudo restart glance-api
sudo restart glance-registry
8. Export the following environment variables.
export SERVICE_TOKEN=admin
export OS_TENANT_NAME=admin
export OS_USERNAME=admin
export OS_PASSWORD=admin
export OS_AUTH_URL="http://localhost:5000/v2.0/"
export SERVICE_ENDPOINT=http://localhost:35357/v2.0
Note: you can add these variables to ~/.bashrc.
9. Check if glance was successfully configured
glance index
The above command displays nothing; if you get an output, check the troubleshooting section.
111
Nova Configuration
1. Install Nova packages
sudo apt-get install nova-api nova-cert nova-compute nova-compute-kvm nova-doc nova-
network nova-objectstore nova-scheduler nova-volume rabbitmq-server novnc nova-
consoleauth
2. Edit the /etc/nova/nova.conf file
--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--logdir=/var/log/nova
--state_path=/var/lib/nova
--lock_path =/run/lock/nova
--allow_admin_api=true
--use_deprecated_auth=false
--auth_strategy=keystone
--scheduler_driver=nova.scheduler.simple.SimpleScheduler
--s3_host =10.60.62.12
--ec2_host=10.60.62.12
--rabbit_host=10.60.62.12
--cc_host =10.60.62.12
--nova_url=http://10.60.62.12:8774/v1.1/
--routing_source_ip=10.60.62.12
--glance_api_servers=10.60.62.12:9292
--image_service=nova.image.glance.GlanceImageService
--iscsi_ip_prefix=192.168.4
--sql_connection=mysql://novadbadmin:[email protected]/nova
--ec2_url=http://10.60.62.12:8773/services/Cloud
--keystone_ec2_url=http://10.60.62.12:5000/v2.0/ec2tokens
--api_paste_config=/etc/nova/api-paste.ini
--libvirt_type=kvm
--libvirt_use_virtio_for_bridges=true
--start_guests_on_host_boot=true
--resume_guests_state_on_host_boot=true
--novnc_enabled=true
--novncproxy_base_url=http://10.60.62.12:6080/vnc_auto.html
--vncserver_proxyclient_address=10.60.62.12
--vncserver_listen=10.60.62.12
--network_manager=nova.network.manager.FlatDHCPManager
--public_interface=eth0
--flat_interface=eth0
--flat_network_bridge=br100
--network_size=32
--flat_injected=False
--force_dhcp_release
--iscsi_helper=tgtadm
--connection_type=libvirt
--root_help
Important Note: “10.60.62.12” has to be replaced by your local machine public IP address.
Moreover, you need to change “libvirt_type” variable by the current hypervisor you are using.
112
3. Change the ownership of the /etc/nova folder and permissions for
/etc/nova/nova.conf
sudo chown -R nova:nova /etc/nova
sudo chmod 644 /etc/nova/nova.conf
4. Open /etc/nova/api-paste.ini
Change the following configuration
admin_tenant_name = %SERVICE_TENANT_NAME%
admin_user = %SERVICE_USER%
admin_password = %SERVICE_PASSWORD%
By:
admin_tenant_name = service
admin_user = nova
admin_password = nova
5. Create nova schema in the MySQL database.
sudo nova-manage db sync
6. Provide a range of IPs to be associated to the instances.
sudo nova-manage network create private --fixed_range_v4=10.60.62.0/27 --
bridge=br100 --bridge_interface=eth0 --network_size=32
7. Export the following environment variables.
export OS_TENANT_NAME=admin
export OS_USERNAME=admin
export OS_PASSWORD=admin
export OS_AUTH_URL="http://localhost:5000/v2.0/"
Note: you can add the environment variables at the end of ~/.bashrc file.
8. Manage nova volumes
Create a Physical Volume:
sudo pvcreate /dev/sda3
Create a Volume Group named nova-volumes:
sudo vgcreate nova-volumes /dev/sda3
113
Note: to create a physical volume, you need first to create a primary partition (in this guide,
the partition name is /dev/sda3). In this case you can follow these steps:
9. Restart nova services
sudo service libvirt-bin restart
sudo service nova-network restart
sudo service nova-compute
sudo service nova-api restart
sudo service nova-objectstore restart
sudo service nova-scheduler restart
sudo service nova-volume restart
sudo service nova-consoleauth service
10. Check if nova services are running
sudo nova-manage service list
Sample output:
Note: if you the state of a given service is not :-), then try to run the following commands in
separate terminals:
sudo /usr/bin/nova-compute
sudo /usr/bin/nova-network
…
114
OpenStack Dashboard
1. Install OpenStack Dashboard
sudo apt-get install openstack-dashboard
2. Restart apache service
sudo service apache2 restart
3. Open a browser and enter IP address of your machine
If you followed this tutorial, then the possible logins are:
Username: admin Password: admin
Username: nova Password: nova
Username: glance Password: glance
Username: swift Password swift
Figure 1: Dashboard authentication page
115
Image Configuration
In order to create an image, you can to access the following links to download the needed
images:
http://smoser.brickies.net/ubuntu/ttylinux-uec/old/
http://uec-images.ubuntu.com/
Example: Ubuntu Precise i386 Image
1. Download Ubuntu Precise Version (12.04 LTS)
Download Ubuntu precise version (precise-server-cloudimg-i386-root.tar.gz) from http://uec-
images.ubuntu.com/precise/current/, using the following command:
wget http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-i386.tar.gz
2. Extract the downloaded package
sudo tar fxvz precise-server-cloudimg-i386.tar.gz
The extracted files are:
precise- server-cloudimg-i386-vmlinuz-virtual
precise-server-cloudimg-i386-loader
precise-server-cloudimg-i386.img
3. Add the Ubuntu image into glance database
3.1. Add the kernel file
glance add name="precise32-kernel" disk_format=aki container_format=aki < precise-
server-cloudimg-i386-vmlinuz-virtual
3.2. Add the loader file
glance add name="precise32-ramdisk" disk_format=ari container_format=ari < precise-
server-cloudimg-i386-loader
3.3.Add the image file
Get the id of both the kernel and loader using: glance index
glance index
Sample output:
116
In this case, the id of Ubuntu kernel is 8386c173-cd90-4c7d-8540-da484abd0c1a and the id
of Ubuntu loader is 5e0f8ceb-8fcd-4fc7-9b2b-1bcd3e3d8c9d.
Now, add the image file using the kernel and loader id:
glance add name="Precise32_image" disk_format=ami container_format=ami
kernel_id=8386c173-cd90-4c7d-8540-da484abd0c1a ramdisk_id=5e0f8ceb-8fcd-4fc7-
9b2b-1bcd3e3d8c9d < precise-server-cloudimg-i386.img
4. Using the Horizon, you can find the uploaded image (Precise32_image)
Figure 2: List of OpenStack images
117
Keypair Configuration
1. Generate for you local machine
If you didnt generate akey for you local machine, then run the following command :
ssh-keygen -t rsa -P ""
2. Create keypair
The following command can be used to either generate a new keypair, or to upload an existing
public key.
cd .ssh
nova keypair-add --pub_key id_rsa.pub mykey
nova keypair-list
3. List keypairs
nova keypair-list
Sample output:
4. Check the created keypair
Confirm that the uploaded keypair matches the local key by checking your key's fingerprint
with the ssh-keygen command:
ssh-keygen –l –f ~/.ssh/id_rsa.pub
Sample output:
Note: You can use OpenStack Dashboard to perform all operations related to keypair
generation.
118
Security Groups Configuration
1. List default security groups
nova secgroup-list
Sample output:
2. Enable access to TCP port 22
Allow access to port 22 from all IP addresses (specified in CIDR notation as 0.0.0.0/0) with
the following command:
nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
Sample output:
3. Enable pinging to virtual machine instance by allowing ICMP traffic
nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0
Sample output:
119
Flavors Configuration
1. Flavor overview
Flavors are used to specify the properties of an instance. The following table illustrates the
needed arguments to define a flavor.
Column Description
ID A unique numeric id.
Name A descriptive name. xx.size_name is conventional not required,
though some third party tools may rely on it.
Memory_MB Memory_MB: virtual machine memory in megabytes.
Disk
Virtual root disk size in gigabytes. This is an ephemeral disk the
base image is copied into. When booting from a persistent volume it
is not used. The "0" size is a special case which uses the native base
image size as the size of the ephemeral root volume.
Ephemeral Specifies the size of a secondary ephemeral data disk. This is an
empty, unformatted disk and exists only for the life of the instance.
Swap Optional swap space allocation for the instance.
VCPUs Number of virtual CPUs presented to the instance.
TX_Factor
Optional property allows created servers to have a different
bandwidth cap than that defined in the network they are attached to.
This factor is multiplied by the rxtx_base property of the network.
Default value is 1.0 (that is, the same as attached network).
Is_Public Boolean value, whether flavor is available to all users or private to
the tenant it was created in. Defaults to True.
extra_specs
Additional optional restrictions on which compute nodes the flavor
can run on. This is implemented as key/value pairs that must match
against the corresponding key/value pairs on compute nodes. Can be
used to implement things like special resources (such as flavors that
can only run on compute nodes with GPU hardware).
Table 1: Flavor arguments
2. List available flavors
Use nova flavor-list command to view the list of available flavors:
nova flavor-list
3. Create a flavor
Create a flavor with the following suggested specifications:
sudo nova-manage instance_type create --name=m1.cluster --memory=975 --cpu=2 --
root_gb=100 --ephemeral_gb=10 --flavor=8
120
Instances Management
Instances can be created either by using the dashboard interface or using command line.
1. Create instances with no specifications
nova boot --flavor ID --image Image-ID MyInstanceName
2. Create an instance with an associated keypair
To associate a key with an instance on boot add --key_name Mykey to your command line:
nova boot --image Image-ID --flavor ID --key_name Mykey MyInstanceName
3. Create an instance with a security group
It is also possible to add and remove security groups when an instance is running.
nova add-secgroup MyInstanceName MysecurityGroup
nova remove-secgroup MyInstanceName MysecurityGroup
4. Create an instance with a given keypair and security group
nova boot --flavor ID --image Image-ID --key_name Mykey MyInstanceName
5. Display instance details
nova show MyInstanceName
6. Access an instance
You can connect to an instance console via VNC. The latter can be accessed either by the
Horizon interface, command line or other tools such as virt-manager.
Using command line
nova get-vnc-console host_name novnc
Sample output:
The link displayed above can be used to access the instance console.
121
Using virt-manager
If you cannot connect to VNC console, you can use virt-manager; in this case, use the
following command to download the virt-manager package:
sudo apt-get install virt-manager
To have access to virt-manager inetrface, run the following command,
sudo virt-manager
Using local machine terminal
If the instance you created asked you for login name and password, you can in this case,
access the instance through your local machine. In this case you need to follow these steps:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@Instance_ip_address
For Ubuntu the user name is root or ubuntu.
Example: if you want to access an Ubuntu instance with IP address 10.60.62.8, you can then
run the commands in the following commands:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub [email protected]
Sample output:
122
7. Connecting Instances
The following steps can be followed to connect OpenStack Instances (Assumption: we need
to connect instance with hostname host1 to another instance with hostname host2):
Generate the keypair on host1 & host2 to run ssh (ssh-keygen -t rsa)
On host2
o Check the “sshd_config” on that instance (It’s located in /etc/ssh/sshd_config)
o Uncomment the following two lines in sshd_config
RSAAuthentication yes
PubkeyAuthentication yes
o Append the contents of id_rsa.pub file of host 1 to authorized_keys file of host 2
8. Delete an instance
nova delete MyInstanceName
123
OpenStack Troubleshooting
Glance Exceptions
1. Exception 1: “glance index “ error
ouidad@ouidad:~$ glance index
Failed to show index. Got error:
There was an error connecting to a server
Details: [Errno 111] Connection refused
Solution
In most cases, the above exception is due to glance-api service which may not be running.
Therefore, you need to run the following command to check why the glance-api is not
running.
For the above output, we have an error in the glance-api-paste.ini, so you need to open that file to fix the
error.
ouidad@ouidad:~$ sudo gedit /etc/glance/glance-api-paste.ini
After fixing the error, you need to restart the glance-api service
ouidad@ouidad:~$ sud/usr/bin/glance-apini
124
Nova Exceptions
1. Exception 1: nova services not running “sudo nova-manage service list”
When running “sudo nova-manage service list”, if you a service has “xxx” state, then you
need the service in a separate terminal.
Solution
For example, if nova-compute has “xxx” state, you need to run the following command:
sudo /usr/bin/nova-compute
The same solution can be applied for other services:
sudo /usr/bin/nova-network
sudo /usr/bin/nova-scheduler
sudo /usr/bin/nova-consoleauth
sudo /usr/bin/nova-cert
sudo /usr/bin/nova-volume
2. Exception 2: “sudo nova-manage service list” doesn’t display the expected output
ouidad@ouidad:~$ sudo nova-manage service list
Command failed, please check log for more info
2013-09-02 19:46:28.050 15999 CRITICAL nova [-] No module named
quantumclient.common
Solution
ouidad@ouidad:~$ sudo apt-get install python-quantumclient
3. Exception 3: Unable to start nova compute “libvirtError: operation failed: domain
'instance-…..‘ already exists with uuid …”
Sample output:
Solution
You need to login to nova database and delete the instance id from instances table. Moreover,
you need to delete the instance id from related tables such as
security_group_instance_association and instance_info_caches.
125
Example: we want to delete an instance with id=3
From the tables displayed above, delete the instance id = 3 from
security_group_instance_association and instance_info_caches as well as from
virtual_interfaces table.
126
Dashboard Exceptions
1. Exception 1: “Unable to retrieve images/instances…”
Sample output
Solution
If you get one of the following exceptions, the only way I solved the problem is to drop the
endpoint and re-create them again. Then, you need to reboot your local machine.
References for Appendix A
http://docs.openstack.org/folsom/openstack-ops/content/flavors.html
http://www.hastexo.com/resources/docs/installing-openstack-essex-20121-ubuntu-1204-
precise-pangolin
http://docs.openstack.org/essex/openstack-
compute/starter/content/Introduction_to_OpenStack_and_its_components-d1e59.html
127
Appendix B. OpenStack with VMware ESXi
Configuration
1. Downloading VMware ESXi
Download VMware ESXi (vSphere 5.5) from:
https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55
2. Installing VMware ESXi
After burning VMware ESXi software into a CD, install it on top of your hardware.
3. Download vSphere Client
To manage your VMware ESXi host:
Install vSphere Client in another machine with Windows OS.
After opening the software, login to VMware ESXi machine with your username and
password.
After login, you will get access to VMware ESXi machine resources. In our case,
VMware ESXi machine has an IP address of 10.50.1.166 (Figure 1)
Figure 1: vSphere Client interface: access to VMware ESXi 10.50.1.166
128
4. Create “Openstack VM”
Create a virtual machine on top of VMware ESXi using vSphere Client. The VM will
be used to host OpenStack.
Create the VM with Ubuntu Precise LTS 12.04 64bits Guest.
5. Download VMware vSphere Web Services SDK
Download appropriate SDK from: http://www.vmware.com/support/developer/vc-
sdk/
Copy the SDK to /openstack/vmware file.
Make sure that the WSDL is available by checking if this path is existing
/openstack /vmware/SDK/wsdl/vim25/vimService.wsdl
/openstack /vmware/SDK/wsdl/vim25/vimService.wsdl: this path will be specified
in nova.conf.
6. Configure OpenStack on “VMware ESXi”
You need to follow the same steps provided in OpenStack –KVM documentation.
The main difference here is the nova.conf configuration.
7. Nova.conf Configuration
In this case, you need to specify the compute_driver, host_ip (VMware ESXi machine),
host_username , host_password and sdl_location (for SDK) as follow
[vmware]
host_password = 12357890
host_username = root
host_ip = 10.50.1.166
compute_driver = vmwareapi.VMwareESXDriver
sdl_location=file:///openstack /vmware/SDK/wsdl/vim25/vimService.wsdl
8. Dashboard access
Access OpenStack resources from the Horizon using the IP address of “Openstack VM”.
9. Make sure that you OpenStack is installed wth VMware ESXi
This is done from Horizon interface
Example:
129
Figure 2: OpenStack with VMware ESXi hypervisor
10. Manage OpenStack with VMware ESXi
After configuring OpenStack, you can now download images and create instances.
Each time you create an instance, it will be displayed in vSphere Client interface as
depicted in Figure 1.
Concerning images, you need to add images with vmdk extension. You can find them
in the following website (you can download them from the free images section):
http://stacklet.com
130
Figure 3: access to VMs (OpenStack instances) through vSphere Client interface
References
http://docs.openstack.org/trunk/config-reference/content/vmware.html
https://my.vmware.com/web/vmware/evalcenter?p=vsphere-55
https://www.vmware.com/support/developer/vc-sdk/
131
Appendix C: Hadoop Configuration
Prerequisites for Installing Hadoop
1. Adding a dedicated Hadoop system user (all machines)
Create a Hadoop user account (hduser) for running Hadoop using the following commands:
ouidad@host1:~$ sudo addgroup hadoop
ouidad@host1:~$ sudo adduser --ingroup hadoop hduser
2. Configuring SSH
2.1. To manage cluster’ nodes, Hadoop requires SSH access. In this case, you need to
generate an SSH key for the hduser user.
ouidad@host1:~$ su hduser
Password:
hduser@host1:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
44:f5:7b:85:32:f7:69:c7:d7:fc:75:38:63:32:be:d7 hduser@host1
The key's randomart image is:
+--[ RSA 2048]----+
| ... |
| . . . |
| . + o .|
| . = *o|
| S + *oX|
| . =.o*|
| . ..|
| .. E|
| .. |
+-----------------+
132
2.2. In order to allow Hadoop interacts directly with its nodes, you need to create an RSA key
pair with an empty password. This is done by enable SSH access to your local machine
with this newly created key.
hduser@host1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
3. Install JAVA
3.1.Download jdk-6u45-linux-i586.bin (for 32 bits architecture) from:
http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-
419409.html
3.2.JDK Installation
chmod +x jdk-6u45-linux-i586.bin
sudo ./jdk-6u45-linux-i586.bin
3.3.Make sure that JDK is installed
ouidad@host1:~$ java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
3.4. Move JDK folder from its current location to /home/hduser path
ouidad@host1:~$ sudo cp /Downloads/jdk1.6.0_45 /home/hduser -r
3.5. Change the JDK ownership
ouidad@host1:~$ sudo chown -R hduser:hadoop /home/hduser/jdk1.6.0_45/
133
Installing Hadoop
1. Download Hadoop version 1.2.1 (hadoop-1.2.1.tar.gz) from
http://www.apache.org/dyn/closer.cgi/hadoop/core
2. Extract the downloaded version
ouidad@host1:~/Downloads$ tar -zxvf hadoop-1.2.1.tar.gz
3. Move the extracted folder (hadoop-1.2.1) from Downloads folder to /home/hduser
ouidad@host1:~/Downloads$ sudo cp hadoop-1.2.1 /home/hduser/ -r
4. Change the ownership
ouidad@host1:~/Downloads$ sudo chown -R hduser:hadoop /home/hduser/hadoop-1.2.1
5. Bashrc file configuration (All machines)
You need first to login to the hduser account, then you need to run the following command:
hduser@host1:~$ sudo gedit ~/.bashrc
at the end of the file, add the following line:
export JAVA_HOME=~/jdk1.6.0_45
export PATH =$JAVA_HOME/bin:$PATH
6. Hdfs folder creation (All machines)
You need first to login to the hduser account, then create the following folder:
hduser@host1:~$ sudo mkdir -p /home/hduser/hdfs/temp
hduser@host1:~$sudo chown hduser:hadoop /home/hduser/hdfs/temp
hduser@host1:~$sudo chmod 777 /home/hduser/hdfs/temp/
hduser@host1:~$sudo chmod 775 /home/hduser/hdfs/temp/
134
7. Hadoop Files Configuration (Slave machines)
Move to /hadoop-1.2.1/conf folder to change the following files
7.1. hadoop-env.sh File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hadoop-env.sh
Replace the following two lines:
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
by (uncomment the second line):
# The java implementation to use. Required.
export JAVA_HOME=~/jdk1.6.0_45
Then, add at the end of the file:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
7.2. core-site.xml File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml
Add the following lines between the <configuration> tags:
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hdfs/temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hduser/hdfs/temp</value>
</property>
135
7.3 .mapred-site.xml File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml
Add the following lines between the <configuration> tags:
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
7.4. hdfs-site.xml File
hduser@host1:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3 </value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Note: Number 3 illustrates the total number of block replication. If you have a cluster of 3-10
nodes, set the replication factor to 3
8. Hadoop Files Configuration (Master)
8.1. core-site.xml File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit core-site.xml
Add the following lines between the <configuration> tags:
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property> <name>hadoop.tmp.dir</name>
<value>/home/hduser/hdfs/temp</value>
<description>A base for other temporary directories.</description> </property>
136
<property>
<name>dfs.name.dir</name> <value>/home/hduser/hdfs/temp</value> </property>
8.2 .mapred-site.xml File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit mapred-site.xml
Add the following lines between the <configuration> tags:
<property>
<name>mapred.job.tracker</name>
<value>master: 54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task. </description> </property>
8.3. hdfs-site.xml File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit hdfs-site.xml
Add the following lines between the <configuration> tags:
<property>
<name>dfs.replication</name>
<value>3
</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
8.4. slaves File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit slaves
Comment the localhost, and add the name of your slaves (you can set your master node as
master and slave at the same by adding the master hostname to slaves file.
master
host1
host2
.
8.4. masters File
hduser@master:~/hadoop-1.2.1/conf$ sudo gedit masters
Comment the localhost, and add the name of your master node.
137
master
Connecting Nodes
1. IP address configuration (All machines)
1.1. Find out the IP address of each machine
hduser@host1:~$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:23:ae:b0:89:ae
inet addr:10.50.0.170 Bcast:10.50.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:198693 errors:0 dropped:0 overruns:0 frame:0
TX packets:9134 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:30871002 (30.8 MB) TX bytes:1334436 (1.3 MB)
Interrupt:21 Memory:fe6e0000-fe700000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:58 errors:0 dropped:0 overruns:0 frame:0
TX packets:58 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:9306 (9.3 KB) TX bytes:9306 (9.3 KB)
1.2. Find out the host name of each machine
hduser@host1:~$ sudo gedit /etc/hostname
1.1. Open hosts file (for each machine)
hduser@host1:~$ sudo gedit /etc/hosts
Replace the content of the file by the IP Addresses of all machines, including in the cluster.
10.50.0.197 master
10.50.0.94 slave
….…
2. Connect the master hduser with the hduser on slaves
Example: For machine with hostname host1
138
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host1
Example: For machine with hostname host2
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@host2
3. Test the connection between each slave and master machine
hduser@master:~$ ssh host1
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-23-generic i686)
* Documentation: https://help.ubuntu.com/
System information as of Sun Jun 30 19:44:28 WEST 2013
System load: 0.08 Processes: 159
Usage of /: 77.7% of 228.23GB Users logged in: 2
Memory usage: 35% IP address for eth0: 10.50.0.170
Swap usage: 0%
=> There is 1 zombie process.
Graph this data and manage this system at https://landscape.canonical.com/
97 packages can be updated.
66 updates are security updates.
Last login: Sun Jun 30 18:39:15 2013 from ip6-localhost
If the connection is set up, you need then to cancel it to continue your installation
hduser@host5:~$ exit
logout
Connection to host5 closed.
139
Formatting the HDFS & Starting Multi-node Cluster
1. Format the HDFS filesystem via the NameNode
hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format
Here is the output:
13/06/30 20:00:42 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/10.50.0.197
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by
'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
13/06/30 20:00:42 INFO util.GSet: VM type = 32-bit
13/06/30 20:00:42 INFO util.GSet: 2% max memory = 19.33375 MB
13/06/30 20:00:42 INFO util.GSet: capacity = 2^22 = 4194304 entries
13/06/30 20:00:42 INFO util.GSet: recommended=4194304, actual=4194304
13/06/30 20:00:42 INFO namenode.FSNamesystem: fsOwner=hduser
13/06/30 20:00:42 INFO namenode.FSNamesystem: supergroup=supergroup
13/06/30 20:00:42 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/06/30 20:00:42 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/06/30 20:00:42 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
13/06/30 20:00:42 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/06/30 20:00:42 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/06/30 20:00:42 INFO namenode.FSEditLog: closing edit log: position=4,
editlog=/home/hduser/hdfs/temp/dfs/name/current/edits
13/06/30 20:00:42 INFO namenode.FSEditLog: close success: truncate to 4,
editlog=/home/hduser/hdfs/temp/dfs/name/current/edits
13/06/30 20:00:43 INFO common.Storage: Storage directory /home/hduser/hdfs/temp/dfs/name has been successfully
formatted.
13/06/30 20:00:43 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197
************************************************************/
2. Start the multi-node cluster
hduser@master:~/hadoop-1.2.1$ bin/start-all.sh
Start both DFS and Hadoop Map/Reduce daemons:
hduser@master:~/hadoop-1.2.1$ bin/start-dfs.sh
hduser@master:~/hadoop-1.2.1$ bin/start-mapred.sh
140
4. On master machine, check if the following java processes are running :
hduser@master:~$ jps
5721 SecondaryNameNode
6738 DataNode
5243 NameNode
6047 TaskTracker
8423 Jps
5805 JobTracker
4. On slave machines, check if the following java processes are running:
hduser@master:~$ jps
1902 DataNode
4002 Jps
2108 TaskTracker
If you get the following oputput:
hduser@host1:~/hadoop-1.2.1/conf$ jps
The program 'jps' can be found in the following packages:
* openjdk-6-jdk
* openjdk-7-jdk
Ask your administrator to install one of them
Then install one of the suggested packages:
hduser@host1:~/hadoop-1.2.1/conf$ sudo apt-get install openjdk-7-jdk
Note: if you didn’t get the same services, follow the suggestion provided for exception 2.
141
Hadoop Troubleshooting
1. Formatting the Namenode Exception: “Cannot lock storage…”
hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format
13/06/30 19:57:35 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/10.50.0.197
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782;
compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
…
….
13/06/30 19:57:38 ERROR namenode.NameNode: java.io.IOException: Cannot lock storage
/home/hduser/hdfs/temp/dfs/name. The directory is already locked.
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:599)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1327)
at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1345)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1207)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1398)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1419)
13/06/30 19:57:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/10.50.0.197
************************************************************/
Solution
Step 1: Stop all processes
hduser@master:~/hadoop-1.2.1$ bin/stop-all.sh
Step 2 : move to /hdfs/temp folder and run the following command
hduser@master:~/hdfs/temp$ sudo rm -rf *
Step 3 : Restart your work by formatting the namenode
hduser@master:~/hadoop-1.2.1$ bin/hadoop namenode -format
142
2. Formatting the Namenode Exception: “Cannot create directory
/home/hduser/hdfs…”
Solution
In this case, make sure that you have set the following permission when creating the
/hdfs/temp folder
hduser@host1:~$sudo chmod 750 /home/hduser/hdfs/temp/
3. Exception in log file: hadoop-hduser-datanode-host1.log or when Hadoop DataNode
doesn’t show up in slave nodes
hduser@host1:~/hadoop-1.2.1/logs$ sudo gedit hadoop-hduser-datanode-host1.log 2013-06-30 19:01:09,078 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
Incompatible namespaceIDs in /home/hduser/hdfs/temp/dfs/data: namenode namespaceID = 1345454277;
datanode namespaceID = 1875045188
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:399)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:309)
at org.apache.hadoop. hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1651)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1590)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1608)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1734)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1751)
Solution 1
1. From master machine, open VERSION file under /hdfs/temp/dfs/name/current folder:
hduser@master:~/hdfs/temp/dfs/name/current$ sudo gedit VERSION
Here is the content of VERSION file:
#Sun Jun 30 20:00:43 WEST 2013
namespaceID=1289101159
cTime=0
storageType=NAME_NODE
layoutVersion=-32
Check the id of the namespace variable ( in this case it is 1289101159); remember the id as
you will need it in the next step
2. From all slaves machines where you found the above exception, open the VERSION file
under /hdfs/temp/dfs/data/current folder:
hduser@host1:~/hdfs/tmp/dfs/data/current$ sudo gedit VERSION
143
Here is the content of VERSION file:
#Fri Jun 14 09:22:08 WET 2013
namespaceID=176572587
storageID=DS-1900366223-127.0.1.1-50010-1371201728420
cTime=0
storageType=DATA_NODE
layoutVersion=-32
Replace the namespaceID variable with the value you found in the VERSION file of the
master.
The content of file VERSION under /hdfs/temp/dfs/data/current folder is:
#Fri Jun 14 09:22:08 WET 2013
namespaceID=1289101159
storageID=DS-1900366223-127.0.1.1-50010-1371201728420
cTime=0
storageType=DATA_NODE
layoutVersion=-32
Solution 2
1. Stop the whole cluster
2. Delete the data directory on the problematic DataNode: the directory is specified
by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory
is /hdfs/temp /dfs/data.
3. Reformat the NameNode.
4. Restart the cluster.
4. Safe mode exception when running MapReduce examples
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException:
Cannot delete /benchmarks/TestDFSIO. Name node is in safe mode.
The reported blocks is only 3601 but the threshold is 0.9990 and the total blocks 3748. Safe mode will be turned
off automatically.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:2111)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2088)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:832)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
Solution
hduser@master:~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave
Safe mode is OFF
hduser@master:~/hadoop-1.2.1$ bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean
TestDFSIO.0.0.4
144
References for Appendix C
[1]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-
cluster/
[2]http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-
cluster/#solution-2-manually-update-the-namespaceid-of-problematic-datanodes
145
Appendix D: TeraSort and TestDFSIO Execution
1. TeraSort
1.1.Generate the TeraSort input data using TeraGen
TeraGen generates random data that can be conveniently used as input data for a subsequent
TeraSort run. The command to run TeraGen in order to generate 100 MB of input data is:
bin/hadoop jar hadoop-*examples*.jar teragen 1000000 /home/hduser/terasort-input
1000000 specifies the number of rows of input data to generate, each of which having a size
of 100 bytes.
1.2.Run the actual TeraSort benchmark using TeraSort
The syntax to run the TeraSort benchmark is as follows:
bin/hadoop jar hadoop-*examples*.jar terasort /home/hduser/terasort-input /home/hduser/terasort-output
1.3.Validate the sorted output data of TeraSort using TeraValidate
The syntax to run the TeraValidate is as follow:
bin/hadoop jar hadoop-*examples*.jar teravalidate /home/hduser/terasort-input /home/hduser/terasort-output
1. Check TeraSort Analysis
To check the average time to generate 100 MB, you need to run the following command:
bin/hadoop job -history /home/hduser/terasort-input
To check the average time to sort 100 MB, you need to run the following command:
bin/hadoop job -history /home/hduser/terasort-output
2. Clean up your temporary files
When re-running TeraSort Benchmark, you need to clean up all generated files in the first
TeraSort test.
bin/hadoop dfs -rmr /home/hduser/terasort-input
bin/hadoop dfs -rmr /home/hduser/terasort-output
146
2. TestDFSIO
2.1. Write data using TestDFSIO-Write
To generate 1000MB dataset, you need to specify an input with 10 files, and each file with
10MB. To allow this operation, the following command needs to be executed:
hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 10
A sample output of TestDFSIO-write operation provides information about the throughput,
average I/O rate, I/O rate standard deviation and test execution time.
13/11/07 15:37:27 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
13/11/07 15:37:27 INFO fs.TestDFSIO:Date & time: Thu Nov 07 15:37:27 UTC 2013
13/11/07 15:37:27 INFO fs.TestDFSIO: Number of files: 10
13/11/07 15:37:27 INFO fs.TestDFSIO: Total MBytes processed: 100
13/11/07 15:37:27 INFO fs.TestDFSIO: Throughput mb/sec: 5.680527152919791
13/11/07 15:37:27 INFO fs.TestDFSIO: Average IO rate mb/sec: 9.899490356445312
13/11/07 15:37:27 INFO fs.TestDFSIO: IO rate std deviation: 7.567628183406918
13/11/07 15:37:27 INFO fs.TestDFSIO: Test exec time sec: 17.568
13/11/07 15:37:27 INFO fs.TestDFSIO:
2.2.Read data using TestDFSIO-Read
After getting the results of TestDFSIO-write command, the next step is to run TestDFSIO-
read operation. In this case, to read the previous generated data, the following command needs
to be executed.
hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 10
A sample output of write operation provides information about the throughput, average I/O
rate, I/O rate standard deviation and test execution time.
13/11/07 15:38:11 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/11/07 15:38:11 INFO fs.TestDFSIO: Date & time: Thu Nov 07 15:38:11 UTC 2013
13/11/07 15:38:11 INFO fs.TestDFSIO: Number of files: 10
13/11/07 15:38:11 INFO fs.TestDFSIO: Total MBytes processed: 100
13/11/07 15:38:11 INFO fs.TestDFSIO: Throughput mb/sec: 70.57163020465772
13/11/07 15:38:11 INFO fs.TestDFSIO: Average IO rate mb/sec: 73.69004821777344
13/11/07 15:38:11 INFO fs.TestDFSIO: IO rate std deviation: 16.249892929638822
13/11/07 15:38:11 INFO fs.TestDFSIO: Test exec time sec: 15.51
13/11/07 15:38:11 INFO fs.TestDFSIO:
2.3.Clean your cluster
The last step is to clean up the generated data using the following command:
bin/hadoop jar hadoop-*test*.jar TestDFSIO -clean
147
Appendix E: Data Gathering for TeraSort
1. Hadoop Physical Cluster
Number of
Machines
Dataset
Size Map Test 1 Test 2 Test 3 Mean
3
100 MB Map 6 7 6 6.33
100 MB Shuffling 10 10 11 10.33
100 MB Reduce 5 5 4 4.67
1 GB Map 13 14 16 14.33
1 GB Shuffling 83 81 85 83.00
1 GB Reduce 99 77 93 89.67
10 GB Map 31 22 19 24.00
10 GB Shuffling 1065 921 930 972.00
10 GB Reduce 1511 1841 1679 1677.00
30 GB Map 26 25 28 26.33
30 GB Shuffling 2971 2312 3081 2788.00
30 GB Reduce 9522 7544 8434 8500.00
4
100 MB Map 5 7 7 6.33
100 MB Shuffling 10 10 10 10.00
100 MB Reduce 4 6 4 4.67
1 GB Map 14 16 15 15.00
1 GB Shuffling 81 79 80 80.00
1 GB Reduce 99 87 82 89.33
10 GB Map 19 21 19 19.67
10 GB Shuffling 951 921 881 917.67
10 GB Reduce 1680 1714 1421 1605.00
30 GB Map 21 22 20 21.00
30 GB Shuffling 2860 2912 3120 2964.00
30 GB Reduce 5908 6412 6109 6143.00
5
100 MB Map 5 6 6 5.67
100 MB Shuffling 10 10 10 10.00
100 MB Reduce 5 5 5 5.00
1 GB Map 14 14 13 13.67
1 GB Shuffling 79 87 80 82.00
1 GB Reduce 84 93 83 86.67
10 GB Map 21 19 20 20.00
10 GB Shuffling 937 900 857 898.00
10 GB Reduce 1729 1611 1360 1566.67
30 GB Map 19 23 22 21.33
30 GB Shuffling 2446 2710 2650 2602.00
30 GB Reduce 5437 7118 6821 6458.67
100 MB Map 6 6 5 5.67
100 MB Shuffling 10 10 11 10.33
100 MB Reduce 4 4 5 4.33
148
1 GB Map 16 18 14 16.00
1 GB Shuffling 86 89 83 86.00
1 GB Reduce 91 75 73 79.67
10 GB Map 18 18 16 17.33
10 GB Shuffling 885 929 906 906.67
10 GB Reduce 1147 1515 1097 1253.00
30 GB Map 20 19 20 19.67
30 GB Shuffling 2731 2694 2725 2716.67
30 GB Reduce 6419 6210 5877 6168.67
7
100 MB Map 6 5 6 5.67
100 MB Shuffling 10 10 10 10.00
100 MB Reduce 5 5 4 4.67
1 GB Map 16 18 12 15.33
1 GB Shuffling 83 81 87 83.67
1 GB Reduce 85 83 80 82.67
10 GB Map 23 27 25 25.00
10 GB Shuffling 985 910 979 958.00
10 GB Reduce 1681 1591 1514 1595.33
30 GB Map 37 23 40 33.33
30 GB Shuffling 2983 2796 2882 2887.00
30 GB Reduce 6514 5891 5338 5914.33
8
100 MB Map 5 5 5 5.00
100 MB Shuffling 10 10 10 10.00
100 MB Reduce 5 4 5 4.67
1 GB Map 15 11 10 12.00
1 GB Shuffling 92 91 88 90.33
1 GB Reduce 80 76 75 77.00
10 GB Map 20 25 29 24.67
10 GB Shuffling 925 1020 893 946.00
10 GB Reduce 1043 1679 2092 1604.67
30 GB Map 27 24 30 27.00
30 GB Shuffling 2812 2777 2834 2807.67
30 GB Reduce 5319 6317 5395 5677.00
149
2. Hadoop Virtualized Cluster- KVM
Number of
KVM VMs
Dataset
Size Map Test 1 Test 2 Test 3 Mean
100 MB Map 4 6 5 5
100 MB Shuffling 7 7 7 7
100 MB Reduce 3 3 3 3
1 GB Map 12 14 12 12.67
1 GB Shuffling 37 37 38 37.33
3 1 GB Reduce 41 41 40 40.67
10 GB Map 24 20 23 22.33
10 GB Shuffling 781 737 718 745.33
10 GB Reduce 336 345 392 357.67
30 GB Map 24 24 23 23.67
30 GB Shuffling 2150 2220 2172 2180.67
30 GB Reduce 1559 1542 1539 1546.67
100 MB Map 5 5 5 5.00
100 MB Shuffling 6 7 7 6.67
100 MB Reduce 3 3 3 3.00
1 GB Map 12 15 16 14.33
1 GB Shuffling 28 34 38 33.33
4 1 GB Reduce 38 40 40 39.33
10 GB Map 28 29 23 26.67
10 GB Shuffling 657 672 657 662.00
10 GB Reduce 438 442 419 433.00
100 GB Map 25 28 25 26.00
100 GB Shuffling 1952 2046 1887 1961.67
100 GB Reduce 1616 1517 1605 1579.33
100 MB Map 5 5 5 5.00
100 MB Shuffling 6 7 6 6.33
100 MB Reduce 3 3 3 3.00
1 GB Map 61 64 85 70.00
1 GB Shuffling 113 109 139 120.33
5 1 GB Reduce 51 41 42 44.67
10 GB Map 33 29 32 31.33
10 GB Shuffling 746 632 877 751.67
10 GB Reduce 445 477 358 426.67
100 GB Map 37 66 51 51.33
100 GB Shuffling 3446 3332 2816 3198.00
100 GB Reduce 1413 1597 1788 1599.33
100 MB Map 5 5 4 4.67
100 MB Shuffling 6 6 6 6.00
100 MB Reduce 3 4 4 3.67
1 GB Map 224 343 266 277.67
1 GB Shuffling 511 464 492 489.00
6 1 GB Reduce 56 48 63 55.67
10 GB Map 45 37 42 41.33
150
10 GB Shuffling 1652 1387 1745 1594.67
10 GB Reduce 404 412 532 449.33
100 GB Map 140 180 50 123.33
100 GB Shuffling 7402 10197 5710 7769.67
100 GB Reduce 1717 1565 1206 1496.00
100 MB Map 5 5 5 5.00
100 MB Shuffling 6 6 6 6.00
100 MB Reduce 4 3 3 3.33
1 GB Map 124 245 365 244.67
1 GB Shuffling 1083 958 1344 1128.33
7 1 GB Reduce 102 121 81 101.33
10 GB Map 61 63 58 60.67
10 GB Shuffling 1024 1984 2062 1690.00
10 GB Reduce 985 1101 1024 1036.67
100 GB Map 185 163 154 167.33
100 GB Shuffling 12112 10197 12024 11444.33
100 GB Reduce 1987 1851 2106 1981.33
100 MB Map 5 5 5 5.00
100 MB Shuffling 6 6 6 6.00
100 MB Reduce 4 3 3 3.33
1 GB Map 162 193 167 174.00
1 GB Shuffling 1201 1320 1259 1260.00
8 1 GB Reduce 545.4 244.42 163.62 317.81
10 GB Map 104 121 97 107.33
10 GB Shuffling 2489.52 2440.32 2536.26 2488.70
10 GB Reduce 1211.55 1354.23 2283.52 1616.43
100 GB Map 201 195 168 188
100 GB Shuffling 11087 14587 13214 12962.667
100 GB Reduce 3088 3145 2906 3046.3333
151
3. Hadoop Virtualized Cluster- VMware ESXi
Number of
VMware VMs Dataset Size Map Test 1 Test 2 Test 3 Mean
100 MB Map 5 5 5 5
100 MB Shuffling 8 7 7 7
100 MB Reduce 4 4 4 4
1 GB Map 18 16 16 17
1 GB Shuffling 42 49 41 44
3 1 GB Reduce 40 38 39 39
10 GB Map 24 22 23 23
10 GB Shuffling 660 636 645 647
10 GB Reduce 492 483 493 489
30 GB Map 44 44 43 44
30 GB Shuffling 4108 3952 3891 3984
30 GB Reduce 2278 2315 2101 2231
100 MB Map 5 5 5 5
100 MB Shuffling 7 7 8 7
100 MB Reduce 4 4 4 4
1 GB Map 19 15 15 16
1 GB Shuffling 38 39 42 40
4 1 GB Reduce 42 41 40 41
10 GB Map 25 24 25 24.66667
10 GB Shuffling 672 691 682 682
10 GB Reduce 486 425 411 440.6667
30 GB Map 35 51 43 43
30 GB Shuffling 2657 3257 3214 3042.667
30 GB Reduce 1985 1852 1865 1900.667
100 MB Map 7 5 5 6
100 MB Shuffling 8 7 7 7
100 MB Reduce 4 3 3 3
1 GB Map 19 21 18 19
1 GB Shuffling 35 30 32 32
5 1 GB Reduce 39 35 37 37
10 GB Map 31 26 28 28
10 GB Shuffling 553 514 503 523
10 GB Reduce 418 432 421 424
30 GB Map 39 36 45 40
30 GB Shuffling 2540 2412 2286 2413
30 GB Reduce 2310 2245 2101 2219
100 MB Map 5 6 5 5
100 MB Shuffling 7 7 6 7
100 MB Reduce 5 4 4 4
1 GB Map 18 18 17 18
1 GB Shuffling 28 29 27 28
6 1 GB Reduce 32 29 34 32
10 GB Map 59 42 41 47
152
10 GB Shuffling 536 552 529 539
10 GB Reduce 369 385 336 363
30 GB Map 30 32 28 30
30 GB Shuffling 2412 2254 2114 2260
30 GB Reduce 2098 1671 1658 1809
100 MB Map 10 10 8 9
100 MB Shuffling 12 11 8 10
100 MB Reduce 4 4 4 4
1 GB Map 24 29 26 26
1 GB Shuffling 35 32 39 35
7 1 GB Reduce 26 34 25 28
10 GB Map 52 56 52 53
10 GB Shuffling 536 520 511 522
10 GB Reduce 298 290 302 297
30 GB Map 84 76 87 82
30 GB Shuffling 3210 2687 2968 2955
30 GB Reduce 1743 1523 1621 1629
100 MB Map 17 16 11 15
100 MB Shuffling 15 16 14 15
100 MB Reduce 4 4 4 4
1 GB Map 81 79 81 80
1 GB Shuffling 92 93 82 89
8 1 GB Reduce 36 36 37 36
10 GB Map 128 102 127 119
10 GB Shuffling 1340 1102 1021 1154
10 GB Reduce 509 562 554 542
30 GB Map 144 137 142 141
30 GB Shuffling 4481 4251 4012 4248
30 GB Reduce 1753 1578 1697 1676
153
Appendix F: Data Gathering for TestDFSIO
1. Hadoop Physical Cluster
Number of Nodes = 3
Dataset
Size
Operatio
n Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 2.867 2.861 2.421 2.72
Average IO rate (mb/sec) 2.903 2.957 2.517 2.79
IO rate standard deviation 0.363 0.505 0.486 0.45
Execution time (sec) 17.786 16.717 18.8 17.77
Read
Throughput (mb/sec) 7.645 6.309 6.558 6.84
Average IO rate (mb/sec) 19.509 11.442 31.255 20.74
IO rate standard deviation 26.167 14.595 40.655 27.14
Execution time (sec) 14.72 16.721 14.705 15.38
1 GB
Write
Throughput (mb/sec) 2.507 2.713 2.204 2.47
Average IO rate (mb/sec) 2.889 2.866 2.498 2.75
IO rate standard deviation 1.2632 0.765 0.929 0.99
Execution time (sec) 77.129 74.47 83.658 78.42
Read
Throughput (mb/sec) 6.037 7.297 5.068 6.13
Average IO rate (mb/sec) 10.231 31.235 8.779 16.75
IO rate standard deviation 10.784 39.149 9.712 19.88
Execution time (sec) 43.468 35.947 42.779 40.73
10 GB
Write
Throughput (mb/sec) 2.503 2.589 3.288 2.79
Average IO rate (mb/sec) 2.671 2.761 3.318 2.92
IO rate standard deviation 0.796 0.817 0.317 0.64
Execution time (sec) 674.535 641.232 363.144 559.64
Read
Throughput (mb/sec) 7.956 7.799 5.458 7.07
Average IO rate (mb/sec) 11.289 12.452 5.786 9.84
IO rate standard deviation 6.421 12.916 1.485 6.94
Execution time (sec) 241.896 296.722 257.708 265.44
100 GB
Write
Throughput (mb/sec) 3.544 3.275 3.275 3.36
Average IO rate (mb/sec) 3.546 3.284 3.282 3.37
IO rate standard deviation 0.089 0.165 0.148 0.13
Execution time (sec) 3315.61 3343.122 3338.37 3332.37
Read
Throughput (mb/sec) 4.746 5.109 5.603 5.15
Average IO rate (mb/sec) 4.791 5.238 12.875 7.63
IO rate standard deviation 0.478 0.852 18.333 6.55
Execution time (sec) 2387.659 2467.634 2734.46 2529.92
154
Number of Nodes = 4
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 2.65 3.303 3.639 3.20
Average IO rate (mb/sec) 2.661 3.543 3.932 3.38
IO rate standard deviation 0.173 0.796 1.212 0.73
Execution time (sec) 17.665 17.039 15.674 16.79
Read
Throughput (mb/sec) 6.405 9.038 5.827 7.09
Average IO rate (mb/sec) 19.631 31.433 19.547 23.54
IO rate standard deviation 28.056 37.351 29.466 31.62
Execution time (sec) 15.35 13.684 14.676 14.57
1 GB
Write
Throughput (mb/sec) 2.556 2.79 2.786 2.71
Average IO rate (mb/sec) 2.669 2.954 2.885 2.84
IO rate standard deviation 0.582 0.747 0.536 0.62
Execution time (sec) 59.677 58.02 61.536 59.74
Read
Throughput (mb/sec) 12.133 6.031 8.264 8.81
Average IO rate (mb/sec) 27.419 7.751 25.182 20.12
IO rate standard deviation 34.001 4.998 35.168 24.72
Execution time (sec) 33.087 40.004 32.861 35.32
10 GB
Write
Throughput (mb/sec) 3.713 3.325 3.201 3.41
Average IO rate (mb/sec) 3.735 3.341 3.22 3.43
IO rate standard deviation 0.283 0.236 0.248 0.26
Execution time (sec) 315.636 347.593 367.294 343.51
Read
Throughput (mb/sec) 5.045 5.738 5.006 5.26
Average IO rate (mb/sec) 5.205 11.437 5.138 7.26
IO rate standard deviation 1.779 15.779 0.884 6.15
Execution time (sec) 258.009 261.24 276.283 265.18
100 GB
Write
Throughput (mb/sec) 3.533 3.354 3.366 3.42
Average IO rate (mb/sec) 3.538 3.356 3.37 3.42
IO rate standard deviation 0.136 0.085 0.111 0.11
Execution time (sec) 3557.813 3507.078 3184.76 3416.5
Read
Throughput (mb/sec) 7.009 6.716 4.349 6.02
Average IO rate (mb/sec) 7.6966 12.229 10.179 10.03
IO rate standard deviation 5.129 9.796 12.546 9.16
Execution time (sec) 2098.422 2700.035 3046.01 2614.8
155
Number of Nodes = 5
Data
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 2.597 2.791 3.804 3.06
Average IO rate (mb/sec) 2.623 2.841 3.941 3.14
IO rate standard deviation 0.28 0.406 0.772 0.49
Execution time (sec) 16.672 15.708 16.679 16.35
Read
Throughput (mb/sec) 8.019 7.213 10.68 8.64
Average IO rate (mb/sec) 32.097 35.821 46.053 37.99
IO rate standard deviation 40.452 47.846 40.287 42.86
Execution time (sec) 14.584 14.501 14.579 14.55
1 GB
Write
Throughput (mb/sec) 2.477 2.896 2.676 2.68
Average IO rate (mb/sec) 2.572 3.032 2.757 2.79
IO rate standard deviation 0.533 0.64 0.533 0.57
Execution time (sec) 59.372 56.319 54.271 56.65
Read
Throughput (mb/sec) 7.659 5.617 8.738 7.34
Average IO rate (mb/sec) 11.029 8.651 25.868 15.18
IO rate standard deviation 7.049 8.984 41.954 19.33
Execution time (sec) 36.214 35.04 30.18 33.81
10 GB
Write
Throughput (mb/sec) 3.309 3.337 3.382 3.34
Average IO rate (mb/sec) 3.329 3.367 3.415 3.37
IO rate standard deviation 0.264 0.335 0.331 0.31
Execution time (sec) 346.239 340.622 361.257 349.37
Read
Throughput (mb/sec) 6.309 5.741 4.771 5.61
Average IO rate (mb/sec) 9.178 13.109 4.839 9.04
IO rate standard deviation 9.178 23.064 0.609 10.95
Execution time (sec) 263.224 256.89 254.85 258.32
100 GB
Write
Throughput (mb/sec) 3.103 3.081 3.343 3.18
Average IO rate (mb/sec) 3.115 3.092 3.349 3.19
IO rate standard deviation 0.191 0.183 0.1386 0.17
Execution time (sec) 3552.118 3478.991 3177.00
1
3402.7
0
Read
Throughput (mb/sec) 4.737 5.198 4.478 4.80
Average IO rate (mb/sec) 6.078 5.292 4.512 5.29
IO rate standard deviation 2462.739 2243.086 2558.05
1
2421.2
9
Execution time (sec) 2.597 2.791 3.804 3.06
156
Number of Nodes = 6
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 3.603 4.19 3.536 3.78
Average IO rate (mb/sec) 3.726 4.329 3.949 4.00
IO rate standard deviation 0.714 0.865 1.337 0.97
Execution time (sec) 16.679 15.703 15.792 16.06
Read
Throughput (mb/sec) 7.017 6.681 8.877 7.53
Average IO rate (mb/sec) 37.162 24.101 33.456 31.57
IO rate standard deviation 49.103 36.636 41.63 42.46
Execution time (sec) 14.165 13.942 14.833 14.31
1 GB
Write
Throughput (mb/sec) 3.089 3.155 3.0178 3.09
Average IO rate (mb/sec) 3.369 3.239 3.088 3.23
IO rate standard deviation 1.13 0.595 0.522 0.75
Execution time (sec) 55.472 51.491 51.749 52.90
Read
Throughput (mb/sec) 7.809 7.593 5.651 7.02
Average IO rate (mb/sec) 8.239 20.751 6.391 11.79
IO rate standard deviation 1.988 34.499 2.169 12.89
Execution time (sec) 33.23 34.396 32.392 33.34
10 GB
Write
Throughput (mb/sec) 3.366 3.133 3.782 3.43
Average IO rate (mb/sec) 3.386 3.139 3.796 3.44
IO rate standard deviation 0.267 0.14 0.229 0.21
Execution time (sec) 347.497 353.804 297.105 332.80
Read
Throughput (mb/sec) 5.681 6.327 14.756 8.92
Average IO rate (mb/sec) 10.302 14.173 27.573 17.35
IO rate standard deviation 13.222 18.079 22.233 17.84
Execution time (sec) 269.214 270.797 176.225 238.75
100 GB
Throughput (mb/sec) 3.343 3.252 3.268 3.29
Average IO rate (mb/sec) 3.352 3.26 3.275 3.30
IO rate standard deviation 0.178 0.173 6.127 2.16
Execution time (sec) 3254.674 3329.312 3313.77
3
3299.2
5
Throughput (mb/sec) 5.435 5.169 6.126 5.58
Average IO rate (mb/sec) 7.827 5.465 11.738 8.34
IO rate standard deviation 8.045 3.505 13.987 8.51
Execution time (sec) 2369.118 2481.531 2168.30
4
2339.6
5
300 GB
Write
Throughput (mb/sec) 3.603 4.19 3.536 3.78
Average IO rate (mb/sec) 3.726 4.329 3.949 4.00
IO rate standard deviation 0.714 0.865 1.337 0.97
Execution time (sec) 16.679 15.703 15.792 16.06
Read
Throughput (mb/sec) 7.017 6.681 8.877 7.53
Average IO rate (mb/sec) 37.162 24.101 33.456 31.57
IO rate standard deviation 49.103 36.636 41.63 42.46
Execution time (sec) 14.165 13.942 14.833 14.31
157
Number of Nodes = 7
Data
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 3.475 3.028 3.475 3.33
Average IO rate (mb/sec) 3.928 3.263 3.928 3.71
IO rate standard deviation 1.605 0.905 1.605 1.37
Execution time (sec) 15.793 15.679 15.793 15.76
Read
Throughput (mb/sec) 9.034 6.669 9.034 8.25
Average IO rate (mb/sec) 29.731 14.642 29.731 24.70
IO rate standard deviation 35.058 25.436 35.058 31.85
Execution time (sec) 14.071 14.688 14.071 14.28
1 GB
Write
Throughput (mb/sec) 3.771 3.837 3.203 3.60
Average IO rate (mb/sec) 3.814 3.887 3.509 3.74
IO rate standard deviation 0.402 0.441 1.118 0.65
Execution time (sec) 44.285 41.408 52.404 46.03
Read
Throughput (mb/sec) 6.069 6.664 6.644 6.46
Average IO rate (mb/sec) 13.227 19.04 7.797 13.35
IO rate standard deviation 19.929 38.007 3.689 20.54
Execution time (sec) 42.883 37.004 32.181 37.36
10 GB
Write
Throughput (mb/sec) 3.377 3.548 3.636 3.52
Average IO rate (mb/sec) 3.395 3.568 3.646 3.54
IO rate standard deviation 0.248 0.28 0.194 0.24
Execution time (sec) 342.034 313.38 311.647 322.35
Read
Throughput (mb/sec) 5.909 7.832 7.661 7.13
Average IO rate (mb/sec) 8.364 18.227 14.755 13.78
IO rate standard deviation 6.168 22.808 17.955 15.64
Execution time (sec) 273.925 238.699 242.805 251.81
100 GB
Write
Throughput (mb/sec) 2.698 3.49 3.609 3.27
Average IO rate (mb/sec) 2.77 3.493 3.611 3.29
IO rate standard deviation 0.508 0.083 0.075 0.22
Execution time (sec) 2987.432 2972.533 2849.33 2936.4
3
Read
Throughput (mb/sec) 3.9676 4.499 4.804 4.42
Average IO rate (mb/sec) 6.569 9.992 6.072 7.54
IO rate standard deviation 8.425 14.613 3.837 8.96
Execution time (sec) 1846.735 1952.279 2653.41
4
2150.8
1
300 GB
Write
Throughput (mb/sec) 3.475 3.028 3.475 3.33
Average IO rate (mb/sec) 3.928 3.263 3.928 3.71
IO rate standard deviation 1.605 0.905 1.605 1.37
Execution time (sec) 15.793 15.679 15.793 15.76
Read
Throughput (mb/sec) 9.034 6.669 9.034 8.25
Average IO rate (mb/sec) 29.731 14.642 29.731 24.70
IO rate standard deviation 35.058 25.436 35.058 31.85
Execution time (sec) 14.071 14.688 14.071 14.28
158
Number of Nodes = 8
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 3.229 2.197 1.828 3.51
Average IO rate (mb/sec) 3.485 2.235 2.198 3.78
IO rate standard deviation 1.161 0.324 1.507 1.2115
Execution time (sec) 15.854 16.137 16.931 15.764
5
Read
Throughput (mb/sec) 5.521 5.721 5.361 7.767
Average IO rate (mb/sec) 18.754 18.857 16.446 30.20
IO rate standard deviation 40.071 40.038 32.486 41.78
Execution time (sec) 14.75 15.846 15.278 14.72
1 GB
Write
Throughput (mb/sec) 3.977 3.701 4.010 3.82
Average IO rate (mb/sec) 4.067 3.826 4.061 3.91
IO rate standard deviation 0.614 0.758 0.473 0.63
Execution time (sec) 38.578 46.594 43.285 43.11
Read
Throughput (mb/sec) 6.079 6.579 5.672 6.49
Average IO rate (mb/sec) 14.054 36.377 14.545 20.03
IO rate standard deviation 24.771 67.854 26.687 35.06
Execution time (sec) 40.959 40.594 42.45 39.94
10 GB
Write
Throughput (mb/sec) 3.718 3.441 3.432 3.52
Average IO rate (mb/sec) 3.733 3.466 3.460 3.54
IO rate standard deviation 0.244 0.306 0.320 0.23
Execution time (sec) 305.018 332.807 337.531 323.16
Read
Throughput (mb/sec) 7.692 8.208 6.057 6.43
Average IO rate (mb/sec) 14.645 16.074 13.983 11.94
IO rate standard deviation 12.945 16.939 19.665 12.82
Execution time (sec) 292.329 230.154 289.889 286.00
100 GB
Write
Throughput (mb/sec) 3.208 3.568 3.592 3.54
Average IO rate (mb/sec) 3.216 3.571 3.595 3.55
IO rate standard deviation 0.154 0.086 0.095 0.13
Execution time (sec) 3386.666 2908.484 2918.7 2988.4
Read
Throughput (mb/sec) 4.959 4.648 5.303 5.23
Average IO rate (mb/sec) 4.941 4.694 6.544 7.66
IO rate standard deviation 1.292 0.495 4.823 6.73
Execution time (sec) 2508.348 2343.769 2358.37 2471.1
300 GB
Write
Throughput (mb/sec) 2.641 2.629 2.638 2.63
Average IO rate (mb/sec) 2.657 2.644 2.654 2.65
IO rate standard deviation 0.190 0.178 0.196 0.19
Execution time (sec) 7882.599 7862.87 8000.19 7917.8
Read
Throughput (mb/sec) 4.202 4.111 4.307 4.28
Average IO rate (mb/sec) 5.188 5.044 8.211 6.04
IO rate standard deviation 4.432 3.660 12.171 6.15
Execution time (sec) 5386.197 5796.385 5551.46 5546.3
159
2. Hadoop Virtualized Cluster- KVM
Number of KVM VMs = 3
Data
Size Operation Criteria Test1 Test2 Test3 Mean
Throughput (mb/sec) 6.804 9.417 6.152 7.39
Average IO rate (mb/sec) 11.989 20.359 15.808 7.71
100 MB IO rate standard deviation 9.399 15.769 16.049 1.63
Write Execution time (sec) 15.439 15.405 13.405 14.75
Throughput (mb/sec) 101.833 96.618 104.275 16.12
Average IO rate (mb/sec) 102.428 102.200 105.770 16.88
Read IO rate standard deviation 7.777 22.154 12.986 4.15
Execution time (sec) 13.479 14.881 13.399 13.92
Throughput (mb/sec) 7.764 7.231 8.201 7.73
Average IO rate (mb/sec) 9.173 7.397 11.249 9.27
1 GB IO rate standard deviation 3.808 1.115 6.539 3.82
Write Execution time (sec) 40.681 40.126 38.515 39.77
Throughput (mb/sec) 22.912 19.046 25.187 22.38
Average IO rate (mb/sec) 30.131 19.969 40.464 30.19
Read IO rate standard deviation 16.609 4.422 34.112 18.38
Execution time (sec) 20.441 20.518 19.44 20.13
Throughput (mb/sec) 7.409 7.429 7.323 7.39
Average IO rate (mb/sec) 7.837 7.68 7.616 7.71
10 GB IO rate standard deviation 1.917 1.43 1.55 1.63
Write Execution time (sec) 283.681 283.554 288.894 285.38
Throughput (mb/sec) 15.179 16.526 16.663 16.12
Average IO rate (mb/sec) 15.23 17.456 17.96 16.88
Read IO rate standard deviation 0.899 5.7934 5.753 4.15
Execution time (sec) 148.455 133.574 128.987 137.01
Throughput (mb/sec) 6.704 7.621 7.621 7.32
Average IO rate (mb/sec) 6.883 7.557 7.247 7.23
IO rate standard deviation 1.147 1.554 1.512 1.40
100 GB Write Execution time (sec) 2929.379 2666.541 2812.221 2802.71
Throughput (mb/sec) 15.959 15.79 15.79 15.85
Average IO rate (mb/sec) 16.413 15.831 15.831 16.03
Read IO rate standard deviation 0.717 0.818 0.724 0.75
Execution time (sec) 1316.845 1486.787 1438.554 1414.06
160
Number of KVM VMs = 4
Data
Size Operation Criteria Test1 Test2 Test3 Mean
Throughput (mb/sec) 4.842 5.13 4.721 5.25
Average IO rate (mb/sec) 8.131 12.451 12.213 6.71
100 MB IO rate standard deviation 5.865 13.164 16.688 4.41
Write Execution time (sec) 14.37 15.692 15.473 15.18
Throughput (mb/sec) 95.419 77.042 96.061 11.37
Average IO rate (mb/sec) 100.145 84.183 97.716 11.97
Read IO rate standard deviation 21.111 23.277 12.857 2.92
Execution time (sec) 14.387 14.364 15.534 14.76
Throughput (mb/sec) 5.825 5.557 5.677 5.69
Average IO rate (mb/sec) 7.556 7.236 7.601 7.46
1 GB IO rate standard deviation 5.199 4.868 5.323 5.13
Write Execution time (sec) 40.198 38.079 40.489 39.59
Throughput (mb/sec) 26.314 33.061 23.697 27.69
Average IO rate (mb/sec) 45.562 52.421 31.314 43.10
Read IO rate standard deviation 42.962 41.111 18.684 34.25
Execution time (sec) 19.474 15.461 19.475 18.14
Throughput (mb/sec) 5.817 5.188 5.182 5.40
Average IO rate (mb/sec) 7.263 6.567 6.457 6.76
10 GB IO rate standard deviation 4.535 4.212 3.955 4.23
Write Execution time (sec) 270.133 296.114 301.458 289.24
Throughput (mb/sec) 14.008 11.722 11.517 12.42
Average IO rate (mb/sec) 15.293 12.759 18.052 15.37
Read IO rate standard deviation 4.447 3.54 16.12 8.04
Execution time (sec) 118.331 144.184 130.603 131.04
Throughput (mb/sec) 5.149 5.339 5.259 5.25
Average IO rate (mb/sec) 6.361 6.886 6.886 6.71
100 GB IO rate standard deviation 3.833 4.625 4.78 4.41
Write Execution time (sec) 2778.663 2780.868 2824.785 2794.77
Throughput (mb/sec) 11.655 11.181 11.269 11.37
Average IO rate (mb/sec) 12.193 12.002 11.724 11.97
Read IO rate standard deviation 2.891 3.319 2.549 2.92
Execution time (sec) 1369.266 1318.89 1520.755 1402.97
161
Number of KVM VMs = 5
Data
Size Operation Criteria Test1 Test2 Test3 Mean
Throughput (mb/sec) 5.796 4.807 2.949 4.52
Write Average IO rate (mb/sec) 6.55 5.447 3.682 5.23
100 MB
IO rate standard deviation 2.342 2.171 2.446 2.32
Execution time (sec) 14.444 14.696 14.398 14.51
Throughput (mb/sec) 42.481 54.171 54.083 50.25
Average IO rate (mb/sec) 52.311 65.455 63.057 60.27
Read IO rate standard deviation 21.799 23.039 20.053 21.63
Execution time (sec) 14.39 14.466 14.534 14.46
Throughput (mb/sec) 3.962 2.168 2.552 2.89
Average IO rate (mb/sec) 4.422 2.215 2.626 3.09
1 GB Write IO rate standard deviation 1.699 0.375 0.527 0.87
Execution time (sec) 42.716 37.287 37.65 39.22
Throughput (mb/sec) 4.883 7.708 5.135 5.91
Average IO rate (mb/sec) 6.698 9.251 5.884 7.28
Read IO rate standard deviation 4.412 4.452 2.42 3.76
Execution time (sec) 18.364 17.669 18.061 18.03
Throughput (mb/sec) 3.369 3.495 3.421 3.43
Average IO rate (mb/sec) 3.374 3.497 3.294 3.39
10 GB Write IO rate standard deviation 0.123 0.081 0.057 0.09
Execution time (sec) 262.581 287.531 291.531 280.55
Throughput (mb/sec) 8.792 7.17 8.27 8.08
Average IO rate (mb/sec) 8.558 7.3 7.211 7.69
Read IO rate standard deviation 1.058 0.906 0.906 0.96
Execution time (sec) 128.409 125.356 133.347 129.04
Throughput (mb/sec) 5.149 6.847 5.2 5.73
Average IO rate (mb/sec) 6.361 6.121 5.677 6.05
100 GB Write IO rate standard deviation 4.811 5.255 5.75 5.27
Execution time (sec) 2679.211 2850.74
4 2824.512 2784.82
Throughput (mb/sec) 11.655 11.181 11.269
Average IO rate (mb/sec) 12.193 12.002 11.724 11.97
Read IO rate standard deviation 2.891 3.319 2.549 2.92
Execution time (sec) 1475.121 1214.15 1420.575 1369.95
162
Number of KVM VMs = 6
Data
Size Operation Criteria Test1 Test2 Test3 Mean
Throughput (mb/sec) 5.054 4.318 3.31 4.23
Write Average IO rate (mb/sec) 6.463 4.769 3.984 5.07
100 MB IO rate standard deviation 2.986 2.163 2.446 2.53
Execution time (sec) 14.474 14.44 14.42 14.44
Throughput (mb/sec) 62.035 56.085 24.337 47.49
Average IO rate (mb/sec) 69.145 65.138 62.606 65.63
Read IO rate standard deviation 22.529 23.831 38.026 28.13
Execution time (sec) 14.468 14.441 15.151 14.69
Throughput (mb/sec) 3.089 3.155 2.982 3.08
Write Average IO rate (mb/sec) 3.369 3.239 3.262 3.29
1 GB IO rate standard deviation 1.13 0.595 1.115 0.95
Execution time (sec) 55.472 51.491 57.861 54.94
Throughput (mb/sec) 7.809 7.593 9.488 8.30
Average IO rate (mb/sec) 8.239 20.751 29.577 19.52
Read IO rate standard deviation 1.988 34.499 36.679 24.39
Execution time (sec) 34.23 34.396 31.08 33.24
Throughput (mb/sec) 1.138 0.393 0.862 0.80
Write Average IO rate (mb/sec) 1.326 0.393 0.875 0.86
10 GB IO rate standard deviation 0.105 0.015 0.112 0.08
Execution time (sec) 310.523 372.186 359.615 347.44
Throughput (mb/sec) 0.881 1.437 1.568 1.30
Average IO rate (mb/sec) 3.091 1.639 1.721 2.15
IO rate standard deviation 5.442 0.666 0.645 2.25
Execution time (sec) 144.278 115.98 155.58 138.61
Write
Throughput (mb/sec) 2.597 2.898 2.581 2.69
Average IO rate (mb/sec) 2.516 2.606 2.625 2.58
100 GB IO rate standard deviation 0.155 0.157 0.21 0.17
Execution time (sec) 4130.984 4322.184 4179.124 4210.76
Read
Throughput (mb/sec) 4.365 4.125 4.335 4.28
Average IO rate (mb/sec) 4.744 4.994 3.951 4.56
IO rate standard deviation 1.235 1.352 1.228 1.27
Execution time (sec) 3115.787 3411.599 3954.8 3494.06
163
Number of KVM VMs = 7
Data
Size Operation Criteria Test1 Test2 Test3 Mean
Throughput (mb/sec) 2.81 2.419 2.604 2.61
Average IO rate (mb/sec) 3.285 2.562 2.68 2.84
100 MB IO rate standard deviation 1.731 0.719 0.477 0.98
Write Execution time (sec) 16.788 17.535 19.524 17.95
Throughput (mb/sec) 36.311 39.541 40.404 38.75
Average IO rate (mb/sec) 42.668 52.131 52.211 49.00
Read IO rate standard deviation 16.107 22.127 24.376 20.87
Execution time (sec) 15.563 15.498 15.573 15.54
Throughput (mb/sec) 2.027 2.263 2.234 2.17
Average IO rate (mb/sec) 2.072 2.342 2.273 2.23
1 GB IO rate standard deviation 0.357 0.484 2.273 1.04
Write Execution time (sec) 69.088 66.069 70.888 68.68
Throughput (mb/sec) 9.77 24.537 11.26 15.19
Average IO rate (mb/sec) 20.969 28.656 25.211 24.95
Read IO rate standard deviation 26.933 13.725 23.079 21.25
Execution time (sec) 32.326 38.813 35.21 35.45
Throughput (mb/sec) 3.052 3.505 3.727 3.43
Average IO rate (mb/sec) 3.073 3.519 3.239 3.28
10 GB IO rate standard deviation 0.267 0.226 0.216 0.24
Write Execution time (sec) 390.563 318.818 301.551 336.98
Throughput (mb/sec) 7.934 8.232 10.567 8.91
Average IO rate (mb/sec) 18.501 8.33 11.837 12.89
Read IO rate standard deviation 6.414 0.9 4.159 3.82
Execution time (sec) 167.311 153.892 163.909 161.70
Throughput (mb/sec) 2.214 2.632 2.325 2.39
Average IO rate (mb/sec) 2.421 2.412 2.514 2.45
100 GB IO rate standard deviation 0.195 0.157 0.21 0.19
Write Execution time (sec) 8303.277 8990.14 8776.16 8689.86
Throughput (mb/sec) 3.218 3.625 5.024 3.96
Average IO rate (mb/sec) 4.421 5.114 2.125 3.89
Read IO rate standard deviation 1.521 1.235 1.095 1.28
Execution time (sec) 7820.6253 7573.74 9175.136 8189.84
164
Number of KVM VMs = 8
Data Size Operation Criteria Test1 Test2 Test3 Mean
Throughput (mb/sec) 7.45965 2.58245 3.80997 4.62
Write Average IO rate (mb/sec) 7.00618 3.8086 4.14562 4.99
IO rate standard deviation 2.206 2.365 2.323 2.30
100 MB
Execution time (sec) 26.14782 43.34817 29.21525 32.90
Throughput (mb/sec) 49.132 46.789 52.311 49.41
Average IO rate (mb/sec) 34.014 43.115 66.94 48.02
Read IO rate standard deviation 13.245 13.154 14.774 13.72
Execution time (sec) 24.96277 31.83195 31.4689 29.42
Throughput (mb/sec) 2.002 2.365 2.004 2.12
Write Average IO rate (mb/sec) 2.105 2.211 2.106 2.14
IO rate standard deviation 0.311 0.12 2.185 0.87
1 GB
Execution time (sec) 93.2688 83.90763 150.2826 109.15
Throughput (mb/sec) 9.02 11.417 11.352 10.60
Average IO rate (mb/sec) 20.123 19.296 20.011 19.81
Read IO rate standard deviation 20.923 18.665 23.001 20.86
Execution time (sec) 38.7912 22.5756 77.462 46.28
Throughput (mb/sec) 3.052 3.505 3.727 3.43
Average IO rate (mb/sec) 3.009 3.157 3.562 3.24
Write IO rate standard deviation 0.213 0.215 0.2 0.21
10 GB
Execution time (sec) 515.5432 420.8398 729.7534 555.38
Throughput (mb/sec) 7.934 8.232 10.567 8.91
Average IO rate (mb/sec) 18.501 8.33 11.837 12.89
Read IO rate standard deviation 5.621 6.211 1.529 4.45
Execution time (sec) 830.37 249.305 368.7953 482.82
Throughput (mb/sec)
Write Average IO rate (mb/sec)
IO rate standard deviation
100 GB
Execution time (sec)
Throughput (mb/sec)
Average IO rate (mb/sec)
Read IO rate standard deviation
Execution time (sec)
165
4. Hadoop Virtualized Cluster- VMware ESXi
Number of VMware ESXi VMs = 3
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 1.534 4.382 5.396 3.77
Average IO rate (mb/sec) 5.854 6.476 8.613 6.98
IO rate standard deviation 5.995 4.094 4.688 4.93
Execution time (sec) 32.586 39.459 33.961 35.34
Throughput (mb/sec) 15.813 15.489 11.664 14.32
Read
Average IO rate (mb/sec) 36.691 40.070 35.419 37.39
IO rate standard deviation 17.121 18.054 17.575 17.58
Execution time (sec) 27.836 29.189 31.256 29.43
1 GB
Write
Throughput (mb/sec) 2.796 2.843 2.267 2.64
Average IO rate (mb/sec) 3.25 3.036 2.63 2.97
IO rate standard deviation 1.284 0.661 0.925 0.96
Execution time (sec) 98.707 99.748 105.382 101.28
Read
Throughput (mb/sec) 14.873 16.918 15.707 15.83
Average IO rate (mb/sec) 17.528 18.735 17.787 18.02
IO rate standard deviation 5.826 5.519 5.377 5.57
Execution time (sec) 45.231 45.825 44.245 45.10
10 GB
Write
Throughput (mb/sec) 16.154 17.254 16.259 16.56
Average IO rate (mb/sec) 29.400 29.484 28.354 29.08
IO rate standard deviation 0.002 0.003 0.03 0.01
Execution time (sec) 477.380 467.431 467.431 470.75
Read
Throughput (mb/sec) 17.214 16.213 16.254 16.56
Average IO rate (mb/sec) 90.557 87.254 90.264 89.36
IO rate standard deviation 0.0219 0.0211 0.003 0.02
Execution time (sec) 138.808 153.864 162.121 151.60
100 GB
Write
Throughput (mb/sec) 8.215 8.255 6.923 7.80
Average IO rate (mb/sec) 6.874 6.254 7.257 6.80
IO rate standard deviation 0.952 0.961 1.021 0.98
Execution time (sec) 4630.131 4766.423 4621.21 4672.59
Read
Throughput (mb/sec) 12.214 12.214 12.214 12.21
Average IO rate (mb/sec) 15.24 15.24 15.24 15.24
IO rate standard deviation 2.721 2.745 2.847 2.77
Execution time (sec) 1621.001 1569.541 1642.21 1610.92
166
Number of VMware ESXi VMs = 4
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 3.621 3.256 4 3.63
Average IO rate (mb/sec) 6.094 6.509 5.994 6.20
IO rate standard deviation 3.863 4.442 4.69 4.33
Execution time (sec) 32.953 38.188 33.315 34.82
Read
Throughput (mb/sec) 22.207 13.139 11.465 15.60
Average IO rate (mb/sec) 35.199 33.706 30.311 33.07
IO rate standard deviation 11.124 17.761 22.259 17.05
Execution time (sec) 27.896 29.868 29.815 29.19
1 GB
Write
Throughput (mb/sec) 2.652 3.716 4.593 3.65
Average IO rate (mb/sec) 2.808 4.021 5.233 4.02
IO rate standard deviation 0.718 1.157 2.367 1.41
Execution time (sec) 91.877 90.642 82.769 88.43
Read
Throughput (mb/sec) 18.332 19.887 12.121 16.78
Average IO rate (mb/sec) 24.537 37.693 21.528 27.92
IO rate standard deviation 17.041 40.033 20.724 25.93
Execution time (sec) 43.546 43.877 41.49 42.97
10 GB
Write
Throughput (mb/sec) 16.211 16.001 15.251 15.82
Average IO rate (mb/sec) 24.756 29.481 25.328 26.52
IO rate standard deviation 0.004 0.002 0.002 0.00
Execution time (sec) 474.891 457.717 415.126 449.24
Read
Throughput (mb/sec) 13.254 12.354 16.321 13.98
Average IO rate (mb/sec) 22.644 21.14 23.214 22.33
IO rate standard deviation 0.001 0.014 0.003 0.01
Execution time (sec) 151.35 120.893 139.212 137.15
100 GB
Write
Throughput (mb/sec) 4.215 4.101 4.259 4.19
Average IO rate (mb/sec) 6.214 5.214 6.254 5.89
IO rate standard deviation 0.617 1.002 0.658 0.76
Execution time (sec) 4384.964 4514.001 3913.98 4270.98
Read
Throughput (mb/sec) 12.214 13.12 12.542 12.63
Average IO rate (mb/sec) 16.241 15.214 15.24 15.57
IO rate standard deviation 2.125 2.155 2.314 2.20
Execution time (sec) 1573.197 1203.144 1503.98 1426.78
167
Number of VMware ESXi VMs = 5
Dataset
Size Operation Cretiria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 6.787 5.783 5.797 6.12
Average IO rate (mb/sec) 7.487 6.311 6.085 6.63
IO rate standard deviation 2.357 1.683 1.291 1.78
Execution time (sec) 21.832 19.816 22.687 21.45
Read
Throughput (mb/sec) 32.553 30.599 31.699 31.62
Average IO rate (mb/sec) 33.458 33.203 19.672 28.78
IO rate standard deviation 9.214 8.962 9.374 9.18
Execution time (sec) 20.689 17.884 18.865 19.15
1 GB
Write
Throughput (mb/sec) 1.926 2.032 2.497 2.15
Average IO rate (mb/sec) 2.031 2.133 2.648 2.27
IO rate standard deviation 0.526 0.569 0.793 0.63
Execution time (sec) 76.754 79.927 68.664 75.12
Read
Throughput (mb/sec) 18.093 9.005 14.98 14.03
Average IO rate (mb/sec) 23.024 9.845 17.178 16.68
IO rate standard deviation 11.61 3.008 7.886 7.50
Execution time (sec) 26.606 34.991 32.247 31.28
10 GB
Write
Throughput (mb/sec) 3.065 3.131 3.03 3.08
Average IO rate (mb/sec) 3.079 3.143 3.048 3.09
IO rate standard deviation 0.217 0.203 0.238 0.22
Execution time (sec) 421.213 417.535 427.938 422.23
Read
Throughput (mb/sec) 10.624 10.144 10.566 149.57
Average IO rate (mb/sec) 10.701 10.24 10.709 10.50
IO rate standard deviation 0.912 1.023 1.236 4.21
Execution time (sec) 124.088 137.512 131.46 131.02
100 GB
Write
Throughput (mb/sec) 3.202 3.147 3.144 3.16
Average IO rate (mb/sec) 3.298 3.182 3.249 3.24
IO rate standard deviation 0.617 0.335 0.778 0.58
Execution time (sec) 3584.964 3607.653 3595.375 3596.00
Read
Throughput (mb/sec) 11.951 11.709 12.024 11.89
Average IO rate (mb/sec) 12.24 11.877 2.255 8.79
IO rate standard deviation 2.372 2.201 1.667 2.08
Execution time (sec) 1163.197 1085.013 1054.679 1100.96
168
Number of VMware ESXi VMs = 6
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 2.528 4.331 1.161 2.67
Average IO rate (mb/sec) 2.767 4.794 1.924 3.16
IO rate standard deviation 0.803 1.448 1.256 1.17
Execution time (sec) 30.602 23.713 30.648 28.32
Read
Throughput (mb/sec) 18.91 24.085 20.934 21.31
Average IO rate (mb/sec) 23.593 29.105 30.76 27.82
IO rate standard deviation 10.874 12.232 13.055 12.05
Execution time (sec) 20.554 18.235 18.076 18.96
1 GB
Write
Throughput (mb/sec) 3.035 1.663 1.578 2.09
Average IO rate (mb/sec) 4.024 1.683 1.655 2.45
IO rate standard deviation 2.173 0.185 0.389 0.92
Execution time (sec) 58.777 86.469 75.434 73.56
Read
Throughput (mb/sec) 3.201 7.975 8.673 6.62
Average IO rate (mb/sec) 5.086 7.729 9.918 7.58
IO rate standard deviation 4.142 1.629 3.371 3.05
Execution time (sec) 32.292 44.313 44.413 40.34
10 GB
Write
Throughput (mb/sec) 3.132 3.012 2.693 2.95
Average IO rate (mb/sec) 3.163 3.053 2.718 2.98
IO rate standard deviation 0.329 0.366 0.261 0.32
Execution time (sec) 375.74 408.223 462.919 415.63
Read
Throughput (mb/sec) 8.422 9.66 9.21 9.10
Average IO rate (mb/sec) 8.489 9.26 9.32 9.02
IO rate standard deviation 0.774 2.301 1.254 1.44
Execution time (sec) 122.793 125.499 133.985 127.43
100 GB
Write
Throughput (mb/sec) 27.459 26.086 26.888 26.81
Average IO rate (mb/sec) 27.459 26.086 26.888 26.81
IO rate standard deviation 7.555 0.005 0.002 2.52
Execution time (sec) 3669.984 3881.374 3752.234 3767.86
Read
Throughput (mb/sec) 92.256 98.351 96.926 95.84
Average IO rate (mb/sec) 92.256 98.351 96.926 95.84
IO rate standard deviation 0.009 0.0156 0.017 0.01
Execution time (sec) 1106.825 1042.597 1049.995 1066.47
169
Number of VMware ESXi VMs = 7
Dataset
Size Operation Criteria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 16.815 16.87 22.75 18.81
Average IO rate (mb/sec) 16.835 15.22 22.18 18.08
IO rate standard deviation 0.021 0.003 0.005 0.01
Execution time (sec) 23.757 21.82 20.727 22.10
Read
Throughput (mb/sec) 112.524 137.741 124.069 124.78
Average IO rate (mb/sec) 103.235 135.542 131.096 88.89
IO rate standard deviation 0.019 0.027 0.015 6.01
Execution time (sec) 18.002 16.993 16.189 18.81
1 GB
Write
Throughput (mb/sec) 21.989 22.135 18.241 20.79
Average IO rate (mb/sec) 21.989 22.187 18.215 20.80
IO rate standard deviation 0.003 0.004 1.526 0.51
Execution time (sec) 66.387 69.104 78.357 71.28
Read
Throughput (mb/sec) 66.366 72.849 76.694 71.97
Average IO rate (mb/sec) 66.366 62.325 79.241 69.31
IO rate standard deviation 0.011 0.007 0.012 0.01
Execution time (sec) 44.061 31.754 42.002 39.27
10 GB
Write
Throughput (mb/sec) 25.951 21.215 22.587 23.25
Average IO rate (mb/sec) 23.465 26.124 25.638 25.08
IO rate standard deviation 0.005 0.003 0.005 0.00
Execution time (sec) 412.5 400.975 417.404 410.29
Read
Throughput (mb/sec) 92.125 86.671 80.214 86.34
Average IO rate (mb/sec) 98.851 97.256 92.541 96.22
IO rate standard deviation 0.006 0.004 0.004 0.005
Execution time (sec) 121.16 132.544 132.544 128.75
100 GB
Write
Throughput (mb/sec) 19.274 25.261 23.574 22.70
Average IO rate (mb/sec) 26.332 28.315 27.036 27.23
IO rate standard deviation 0.002 0.005 0.004 0.004
Execution time (sec) 3826.909 3645.909 3727.394 3733.40
Read
Throughput (mb/sec) 45.215 55.547 65.963 55.58
Average IO rate (mb/sec) 95.214 94.686 84.254 91.38
IO rate standard deviation 0.019 0.023 0.014 0.02
Execution time (sec) 1074.606 994.225 980.919 1016.58
170
Number of VMware ESXi VMs = 8
Dataset
Size Operation Cretiria Test1 Test2 Test3 Mean
100 MB
Write
Throughput (mb/sec) 6.352 6.214 5.214 5.93
Average IO rate (mb/sec) 8.359 16.072 9.325 11.25
IO rate standard deviation 0.001 0.035 0.001 0.01
Execution time (sec) 42.097 22.322 34.282 32.90
Read
Throughput (mb/sec) 69.215 82.254 68.325 73.26
Average IO rate (mb/sec) 93.721 146.511 95.328 111.85
IO rate standard deviation 0.018 0.02 0.014 0.02
Execution time (sec) 27.748 27.962 26.957 27.56
1 GB
Write
Throughput (mb/sec) 7.652 6.521 6.241 6.80
Average IO rate (mb/sec) 13.067 16.873 17.89 15.94
IO rate standard deviation 0.003 4.231 0.002 1.41
Execution time (sec) 94.415 94.711 79.465 89.53
Read
Throughput (mb/sec) 36.678 60.214 62.124 53.01
Average IO rate (mb/sec) 28.352 99.265 78.019 68.55
IO rate standard deviation 0.003 0.021 0.006 0.01
Execution time (sec) 55.273 30.137 62.601 49.34
10 GB
Write
Throughput (mb/sec) 18.124 19.254 18.625 18.67
Average IO rate (mb/sec) 26.557 24.477 24.955 25.33
IO rate standard deviation 0.004 0.004 0.004 0.00
Execution time (sec) 400.564 438.626 432.917 424.04
Read
Throughput (mb/sec) 89.268 119.348 78.019 95.55
Average IO rate (mb/sec) 69.361 80.541 59.013 69.64
IO rate standard deviation 0.008 0.007 0.007 0.0073
Execution time (sec) 130.975 101.048 171.601 134.54
100 GB
Write
Throughput (mb/sec) 18.214 19.421 19.566 19.07
Average IO rate (mb/sec) 27.006 24.451 25.324 25.59
IO rate standard deviation 0.001 0.002 0.002 0.00
Execution time (sec) 3737.64 4138.379 3981.254 3952.43
Read
Throughput (mb/sec) 93.514 90.291 91.157 91.65
Average IO rate (mb/sec) 68.325 78.245 65.247 70.61
IO rate standard deviation 0.143 0.012 0.102 0.09
Execution time (sec) 1090.37 1130.456 1105.645 1108.82