a service for provisioning compute infrastructure in the...

UPTEC IT 19013

Examensarbete 30 hpAugusti 2019

A Service for Provisioning Compute Infrastructure in the Cloud

Tony Wang

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A Service for Provisioning Compute Infrastructure inthe Cloud

Tony Wang

The amount of data has grown tremendously over the last decade. Cloud computingis a solution to handle large-scale computations and immense data sets. However,cloud computing comes with a multitude of challenges that scientist who are using thedata have to tackle. Provisioning and orchestration cloud infrastructure is a challengein itself with a wide variety of applications and cloud providers that are available. Thisthesis explores the idea of simplifying the provisioning of computing cloud applicationsin the cloud. The result of this work is a service which can seamlessly provision andexecute cloud computations using different applications and cloud providers.

Tryckt av: Reprocentralen ITCUPTEC IT 19013Examinator: Lars-Åke NordénÄmnesgranskare: Sverker HolmgrenHandledare: Salman Toor

Contents

1 Introduction 1

2 Background 1

2.1 Cloud Computing Concepts and Obstacles . . . . . . . . . . . . . . . . 2

2.2 Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 HASTE Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Related Work 6

4 System Implementation 8

4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Terraform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 REST Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Message Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.5 Data Aware Functionality . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.6 Negotiator Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.6.1 Resource Availability . . . . . . . . . . . . . . . . . . . . . . . 17

4.6.2 Terraform Configuration Generation . . . . . . . . . . . . . . . 18

4.6.3 Executing Terraform Scripts . . . . . . . . . . . . . . . . . . . 18

4.7 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.8 Infrastructure Implementations . . . . . . . . . . . . . . . . . . . . . . 20

4.8.1 Spark Standalone Cluster . . . . . . . . . . . . . . . . . . . . . 22

4.8.2 HarmonicIO cluster . . . . . . . . . . . . . . . . . . . . . . . . 24

iii

4.8.3 Loading Microscopy Images . . . . . . . . . . . . . . . . . . . 26

4.8.4 Single Container Application . . . . . . . . . . . . . . . . . . . 27

4.9 Simple Web User Interface . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Results 28

5.1 Spark Standalone Cluster . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 HarmonicIO Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Image Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Running a Trivial Container . . . . . . . . . . . . . . . . . . . . . . . 32

6 Discussion & Evaluation 33

6.1 Comparison Against Other Methods . . . . . . . . . . . . . . . . . . . 33

6.1.1 SparkNow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.2 KubeSpray . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.3 Manual Provisioning . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Future Development Complexity . . . . . . . . . . . . . . . . . . . . . 35

6.3 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.4 Data Aware Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.5 Security Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.6 Limitations of This Service . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Future Work 39

8 Conclusion 39

iv

2 Background

1 Introduction

There has been a tremendous growth in data over the past decade. This trend can beobserved in almost every field. The Large Hadron Collider experiment at CERN [2] andSquare Kilometre Array project [7] are examples of scientific experiments dealing withdata beyond the petascale. This requires efficient, scalable and resilient platforms for themanagement of large datasets. Furthermore, to continue with the analysis, it is requiredto make these large datasets available to the computational resources. Recently, togetherwith the cloud infrastructures, a new concept has emerged to offer Infrastructure-as-a-Code (IaC). IaC enables run-time orchestration, contextualization and high-availabilityof resources using programmable interfaces [4]. The concept allows mobility and high-availability of customized computational environments. AWS Cloud Foundry, Open-Stack HOT and Google AppEngine are the platforms aligned with the concept of IaC.However, it is still overwhelming and time-consuming to capitalize on this concept. Inorder to satisfy researchers and to seamlessly access the customized computational en-vironment for the analysis, it is required to create a level of abstraction that hides theplatform-specific details and intelligently place computational environment close to thedatasets required for the analysis.

This thesis proposes a software service that aims to support the researchers in the Hi-erarchical Analysis of Spatial and Temporal Data (HASTE [3]) project to seamlesslycompute applications on different cloud services. The main capabilities of the softwareare its cloud agnostic ability, tracing of the build process of the compute infrastructureand its ability to be data aware, meaning that it can locate the data resource that is usedin the proposed computation.

2 Background

Cloud computing is appearing as a new trend in the ICT sector due to the wide ar-ray of services the cloud can provide. Many companies such as Google and Amazonare offering different kind of cloud services, Google App Engine1 and Amazon Webservices (AWS)2 respectively. Each service manages their own infrastructure in theirown fashion. The cloud providers control large pools of computers and profit from thecloud by renting out user requested resources. The users are billed on a timely, pay permonth or on a usage basis where users pay depending on the workload of the rentedresources. Other than for commercial use, cloud computing is expanding in scientific

1https://cloud.google.com/appengine/2https://aws.amazon.com/

1

2 Background

research using platforms such as OpenStack3 to provide computation. However cloudcomputing comes with many challenges that are tackled by businesses using the cloudfor commercial use along with scientist that are looking for the cloud to run scientificcomputations.

2.1 Cloud Computing Concepts and Obstacles

The term cloud computing has existed since the 1960s however the concept gained pop-ularity in 2006. There has been no clear definition of the term cloud computing. Thedefinition cloud computing is however described by the National Institute of Standardsand Technology (NIST) [13] who defines cloud computing as model for enabling con-venient on demand network access to a shared set of configurable computing resourcesthat can be rapidly provisioned and released with minimal effort.

Generally speaking, the cloud can be divided into four different architectural layers.Zhang et. al. [24] describes the layers in the following way. The lowest level can bedescribed as the hardware layer, this is where the bare metal resides as routers, switches,power and cooling systems. Next is the infrastructure layer which creates a set of re-sources that are configured on the hardware through virtualization technologies. Abovethe infrastructure layer is the platform layer where operating systems and applicationframeworks lie. The final layer is the application layer where software applications aredeployed.

The business model of the cloud can be categorized in different services that are derivedfrom the architectural layers. NIST defines the services as following. Infrastructure asa Service (IaaS) provides processing, storage and networks. The user should have theability to deploy and run software on the infrastructure such as operating systems andapplications. Examples of IaaS providers are Google and Amazon. Platform as a Ser-vice (PaaS) allows the user to use the cloud infrastructure through provided tools. Theuser does not control the underlying networks, operating or storage but only the self de-ployed applications. Software as a Service (SaaS), the highest level of user abstractionwhere the user is only capable of accessing the cloud through the provider’s interfacecommonly in the form of a thin client interface or a web browser.

As mentioned in the introduction, the newly coined concept infrastructure as a Code(IaC) is on the rise. The principle of IaC is to treat the infrastructure as code. Thereafteruse the code to provision and configure the infrastructure, more importantly provision-ing virtual machines (VMs) in IaaS. The code represents and gives the desired state

3https://www.openstack.org/

2

2 Background

of the infrastructure without having to walk through manual steps and previous config-urations [16]. This concepts allows the ability to apply similar software engineeringtechniques related to programming and software development when building one’s in-frastructure. This means that a blueprint or state of the infrastructure can be versioncontrolled, shared and re-used. The end purpose of IaC is to improve the quality ofone’s infrastructure [21].

Another important concept is the container concept which is growing in the cloud com-puting field and containers are most often used at the application level to replace vir-tual machines. There are many advantages of using containers. Containers are morelightweight than VMs, start time and resources usage is decreased [14]. Docker4 is oneof the most well-known and used tool for containerizing applications. Docker providesservices that builds and assembles applications. Each Docker container is based on ansystem image, a static snapshot of a system configuration. A Docker container is runby the client, when a Docker container is ready to be run, it looks for the Docker imageon the machine or on a remote registry to download the Docker image. Once the imageis ready, Docker creates a container and allocates a file system with a read and writelayer and creates a network to interact with the host machine [19]. The main principlesof using Docker containers are to avoid conflicting dependencies, an example could beif two websites need to run two different versions of a framework, then each versioncan be installed in a separate container. Also, all dependencies are bound to a containerwhich means that the need to re-install dependencies disappear if the application is tobe re-deployed. Furthermore, Docker containers are not very platform dependent. Theonly requirement is that the operating system runs Docker [14].

As a consequence of multiple providers many individuals who are using cloud servicesare facing the issues of adapting to each and every different cloud provider. One of themain obstacles is the concept of vendor lock-in [11], meaning that the cost of changingvendor is too high to justify the change which leads to the problem of being locked intoone vendor because of the cost of changing vendor. The lack of standards causes an in-creased difficulty to manage interfaces to different cloud vendors. Multiple recent workshave tackled the problems of vendor lock-in by developing APIs that interfaces to var-ious types of cloud. Developing standards could be a good solution however the largercloud vendor who are leading the cloud business does not seem to agree on proposedstandards.

4https://www.docker.com/

3

2 Background

2.2 Scientific Computing

Scientific computing is a research field that uses computer science to solve problems.The types of research can be related to large scale computing and simulation whichoften requires large amount of computer resources. Recently, scientific computing pro-gressively requires more and more computing to cope with the immense amount of datathat is generated. The amount of data stored by massive-scale simulations sensor de-ployment, high throughput lab equipment etc. has increased in the recent years. Alreadyin 2012, it was predicted that the amount of data generated was to pass 7 trillion giga-bytes [17]. And the amount of data used for computing exceeds the power of an individ-ual computer, to counteract this problem, distributed computing systems are thereforeused in some cases. Cloud computing proposes an alternative solution for running sci-entific computations which can be specifically beneficial for scientific computing. Re-searchers can take advantage of the potential lesser cost of running cloud computationsby reducing the administration costs and taking advantage of flexible cloud scalability.Also, cloud computing allows researchers located in different areas an opportunity toease the collaboration process. Compare this to running computations on personal com-puters or campus exclusive resources where there may be limited resources, securityissues and difficulties in sharing data.

In the large scaled scientific computing field, one of the popular frameworks is ApacheSpark5 (Spark) [22]. Spark is the largest open source software for unified programmingand big data projects. Spark has a programming model called Resilient DistributedDatasets (RDD) that can include a wide range of processing techniques including SQL,machine learning and graph processing. The key point of RDDs is that they are func-tions that are scattered around a compute cluster which can then run functions in parallel.This of course requires that the user has access to cloud infrastructure. Users can thenuse RDDs by applying specific functions, for example map, filter and groupBy.The main speedup of Spark is its data sharing capabilities. Instead of storing its data indisk it stores the data in memory to allow faster shareability. Spark was developed as atool for users, including scientist.

2.3 HASTE Project

Part of the work of this thesis is to assist the HASTE project. Their aims are to intelli-gently process and manage microscopy data in images. The HASTE project is fundedby the Swedish Foundation for Strategic Research [1]. Their main objectives are to dis-cover interestingness in image data using data mining and machine learning techniques

5https://spark.apache.org/

4

2 Background

and to develop intelligent and efficient cloud systems. To find the interestingness of amicroscopic image, different machine learning techniques are used that are processedin the cloud. The HASTE project pipeline and the scientist at the project have differentacademic backgrounds and the whole project consists of different smaller projects thatare related to each other, multiple projects uses the cloud to store and execute data andcomputations.

SNIC Science Cloud6 is a community cloud run by Swedish National Infrastructure forComputing with the purpose of providing large scale computing and storage for researchin Sweden. SNIC mainly provides IaaS and higher level PaaS. The HASTE project runstheir computation on the SNIC Science cloud. The work in this thesis exclusively usesthe SNIC Science Cloud to provision infrastructure for the users.

2.4 Motivation

In order to demonstrate the demand of this service, a motivational example is presented.As of now, researchers of the HASTE project and potentially other individuals as wellwho have to run scientific cloud computations have their own procedures to provisionand cluster their own infrastructure by using their own written scripts, different com-mand line and graphical interfaces. This could be rather time consuming and a re-searcher would perhaps rather spend time doing actual research instead of infrastructureprovisioning. The problem of vendor lock-in comes in here when scripts and interfacesmay vary a lot depending on cloud vendor, a scientist may have successfully executedtheir experiments on one cloud infrastructure, when the need to change cloud providerarises, the process has to be repeated. Another important factor to consider is the physi-cal placement of the data, a scientist may have to manually look up the metadata inside apotentially hidden or difficult to access file. Another difficulty in cloud orchestration isthe error prone characteristics of the long provisioning process. Various errors can arisefrom the orchestration process that are difficult to find and debug. An example scenarioof a HASTE researcher who would like to run a machine learning computation can beacted out in the following fashion. The researcher has to locate the credentials and othermetadata regarding the cloud provider before starting a machine. The next step is toprovision the data requested to the machine, and to execute the code, the researcherhas to install all the required packages. This is a lengthy process that is preferable toavoid having to repeat. Another example is the orchestration of multiple instances. Ifa researcher wants to run a compute cluster, the researcher has to start multiple ma-chines and then with much effort connect them together into a cluster. Again, doing thisprocess can arguably be even more time consuming than previous example.

6https://cloud.snic.se/

5

3 Related Work

2.5 Purpose

The purpose of this thesis is to support the researchers at the HASTE project and tobuild a general software application service for automatic provisioning of cloud infras-tructure with intelligent data aware aspects. The data aware aspects comes from usingpre provisioned metadata to the service to allow the users to skip setting cloud meta-data variables. The main purpose is to provide options for the HASTE researchers toseamlessly run HASTE relevant software on the SNIC Science Cloud through the ser-vice. The purpose is to simplify the provisioning process compared to running HASTEcloud projects through manual provisioning. A general use case of provisioning a Sparkcompute cluster and a case where a container application is run are also provided toexemplify a process which a non HASTE scientist can benefit from. The researchersshould have the ability to provision compute infrastructure through easy accessible com-mand line and graphical interfaces. The service includes the potential to provision notonly to the SNIC Science Cloud but also for other OpenStack cloud projects and othernon-OpenStack cloud providers. Additionally, a tracing mechanism is implemented toprovide transparency and feedback on the clustering process for the underlying orches-tration process, to give the user more transparency about any potential errors during theprocess. Furthermore, a conceptual web interface is created for the purpose of grantingthe user a simple graphical interface to use for creating their infrastructure.

3 Related Work

There exists several other cloud computing frameworks for the purpose of abstractingthe cloud orchestration layer and to counteract the problems of vendor lock-in whereit is too burdensome and difficult to deploy applications on different cloud providerswhile keeping important aspects such as security and quality of service consistent. Afew have been using Model-driven design as their main method for designing the frame-work. Model-driven design is not the focus of this work however it is an interesting takeon development that one can draw inspiration from. Other frameworks and software ap-plications have also been developed and published as open source to help the developercommunity to deploy infrastructure.

Specifically, Ardagna et al. [8] presents the idea of cloud deployment with MODA-CLOUDS which used Model-driven design to develop a framework and IDE for de-veloping and deploying applications on multiple cloud providers. MODACLOUDS issupposed to offer a run-time environment for observing the system during execution toallow developers proactively determine the performance of the system. Their ambitions

6

3 Related Work

are to run MODACLOUDS as a platform for deployment, development, monitoring andadaptations of applications in cloud.

Chen et al. presents MORE [10], this framework uses Model-driven design to easethe challenges of deployment and configuration of a system. MORE provides the userwith a tool to model the topology of a system structure without demanding much do-main knowledge. The model is further transformed into executable code to abstract theorchestration of the system. The user eventually gains access to the cloud infrastructure.

Other non model-driven developed tool also exists. For example Sandobalin, Insfran& Abrahao presents an infrastructure modelling tool for cloud provisioning called AR-GON [15]. The tool is supposed to solve the management of infrastructure as a code(IaC). Their goal is to take the DevOps concept and apply it to IaC. Through a domainspecific language they are able to develop ARGON to reduce the workload for opera-tions personnel. With ARGON, developers have the opportunity to version control andmanage their infrastructure without the need to consider the interoperability of differentcloud providers.

To further investigate into cloud interoperability and approaches to avoid vendor lock-inRepschlaeger, Wind, Zarnekow & Turowski [23] implemented a classification frame-work to compare between different cloud providers. Their purpose was to help e-Governments with the problem of selecting an appropriate cloud vendor with regardsto prices, security and other important features. Their method of development was toinvestigate through literature surveys and expert interviews.

Furthermore Capuccini, Larsson, Toor & Spjuth developed KubeNow [9] a frameworkfor rapid deployment of cloud infrastructure using the concept of IaC through the Kuber-netes framework. The goal of KubeNow is to deliver cloud infrastructure for on-demandscientific applications. Precisely, KubeNow offers deployment on Amazon Web Ser-vices, OpenStack and Google Compute Engine.

Additionally, there are other more well known frameworks that bring the benefits ofIaC. Some examples include Ansible7, Puppet8, AWS OpsWork9 (Uses Chef + Puppet)and Terraform10.

Unruh, Bardas, Zhuang, Ou & DeLoach presents ANCOR [20], a prototype of a systembuilt from their specification. The specification is designed to separate user require-

7https://www.ansible.com/8https://puppet.com/9https://aws.amazon.com/opsworks/

10https://terraform.io

7

4 System Implementation

ments from the under-laying infrastructure, and to be cloud agnostic. ANCOR usesPuppet as a configure management tool however ANCOR supports other configure man-agement tools such as Chef, SaltStack, bcfg2 and CFEngine. ANCOR mainly targetsOpenStack however there is a possibility of using AWS as well. ANCOR developedwith a domain specific language based on YAML. Their conclusions show that ANCORcan improve manageability and maintainability and enable dynamic cloud configurationunder deployment without performance loss.

SparkNow11 is a type of provisioning software that focuses on rapid deployment andteardown of clusters on OpenStack. It simplifies the provisioning process by provid-ing pre written provisioning scripts and through user arguments it can provision therequested infrastructure without requiring the user to learn the orchestration process.KubeSpray12 is similar to SparkNow in the sense that they simplify infrastructure pro-visioning however their focus is on rapid deployment of Kubernetes13 clusters insteadof Spark clusters on OpenStack and AWS clouds.

The related work mentioned in this section pushes their focus on developing standalonetools and different domain specific languages for creating infrastructure. They put alot of effot on deploying cloud applications with ease through their tools reducing thecomplexity of creating cloud infrastructure. The IaC concept is again explored and usedefficiently to provision infrastructure. Cloud vendor lock-in is discussed as well, con-cerning the ability to deploy application on different providers which is important forthe users. This thesis proposes the ability for a user to request cloud infrastructure usingless domain knowledge, the options to choose which provider to deploy infrastructureand monitoring of the provisioning process. Another proposal is to explore further in-frastructure abstraction using even less required knowledge, adding another abstractionlayer over existing software. Using data-aware aspects which takes advantage of meta-data to pre-provision the orchestration service with metadata. Furthermore this workimplements a tracer for the orchestration process to track the orchestration flow.


To develop the provisioning software numerous technology was used. The service issplit into different modular parts who communicate with each other. The user com-municates with a server through interfaces which in turn communicates with another

11https://github.com/mcapuccini/SparkNow12https://github.com/kubernetes-incubator/kubespray13https://kubernetes.io/

8


module which uses external frameworks to provision infrastructure. Overall the systemcan be seen as a client-server application. The whole system is also traced using externallibraries that are integrated within the whole system.

4.1 System Overview

To start of, there is a conceptual graphical user interface that is built as a web interfaceusing common scripting languages HTML, CSS and JavaScript. Moreover the React14

library is used as the main library for writing and structuring the interface. Using React,the business logic and markup is split into components that allows for more flexibilityand re-usability.

The service who handles the requests and provisions the infrastructure is called the ne-gotiator. The REST service lies between the user and the negotiator and that is writtenin python 2.7 with the Flask15 library and its functionality is split into a RepresentationalState Transfer (REST) service that can be called for communication. The middlemanor the broker between the REST server and the negotiator is provided by RabbitMQ16

which is a message broker that handles the requests from the client and sends them tothe negotiator. The negotiator is designed so that new calls to different cloud providerscan be integrated into the module by constructing new classes for each provider. Apply-ing a REST service grants the system an interface between the user and the negotiatorwhich gives the system a possibility to seamlessly alter the communication with the ne-gotiator. By defining the REST endpoints the module can consistently accept expectedarguments to create infrastructure.

SNIC Science Cloud implements OpenStack which is a platform for orchestrating andprovisioning infrastructure. This project uses Terraform in conjunction with OpenStackto provision the compute infrastructure on SNIC Science Cloud. Where Terraform wasused as the framework to provide IaC. Tracing is performed by OpenTracing17 which isan open source tracing library that is available in multiple languages.

The general step by step process of the system from a user perspective to request forinfrastructure can be described as the following steps:

(a) The user creates a POST request to the REST service from any interface which14https://reactjs.org/15http://flask.pocoo.org/16https://www.rabbitmq.com/17https://opentracing.io/

9


Figure 1: A high level overview of the system.

can be from a web interface or a command line interface.

(b) The request arrives at the REST server and the request is forwarded to the messagebroker and the server returns the web URL to the tracing interface.

(c) The message broker receives the request and puts the request, now a message inthe queue for consumption.

(d) The consumer forwards the request to the negotiator module which handles therequest and provisions the infrastructure.

(e) After orchestration, the user is sent feedback regarding the infrastructure.

(f) The process can be traced during and after each request

A high level overview of the system is shown in Figure 1 to gain a better abstract un-derstanding on how the system is communicating. What can be seen is the user whointeracts with the system through a REST service, implemented as a Flask server. Therequest is forwarded to the negotiator who then depending on the request provisionsinfrastructure in the user requested cloud provider.

The system from a user or a scientist perspective can be seen in Figure 2. The usermay interact with the system by requesting or deleting infrastructure. The user may alsoaccess the trace of the requested process of the system.

10


Figure 2: User perspective of the system

11


4.2 Terraform

Terraform is a tool that applies the IaC for provisioning infrastructure. Terraform canbe used to build, change and version control cloud infrastructure. The state of the cloudinfrastructure is described in Terraform configuration files written by the users, aftersuccessfully executing the Terraform configuration scripts using the Terraform binary,the infrastructure requested in the configuration file is provisioned. The main motivationfor using Terraform is its simplicity to change and add new infrastructure for differentproviders and also the power of IaC that is used to dynamically provision infrastruc-ture. Furthermore using Terraform may avoid the problem with vendor lock-in becauseof the multitude of providers Terraform supports. Software such as Heat18 works sim-ilarly however they only provide for one platform (OpenStack). While Terraform canperform the same tasks and also enable multiple providers, for example it can orches-trate an AWS and an OpenStack cluster at the same time. Terraform is cloud agnosticin the sense that the software Terraform can be used by various providers however onemay think that a single configuration can be used by different providers but that is notthe case. To create an infrastructure configuration that creates an equivalent copy ofan infrastructure on two different providers then one has to write two different con-figurations, although some of the functions can be shared such as variables. Still it issimple to change provider, the syntax, functions and thought process to write code staysthe same. The configuration files are written in HCL (HashiCorp Configuration Lan-guage)19, which is a configuration language built by HashiCorp20 who are the foundersof Terraform. The same language is then used for all the providers that Terraform sup-ports. HCL can also be used in conjunction with the JSON format to allow for more flex-ibility. An example configuration can be seen in Listing 1, which shows a configurationfor an AWS cloud when executed Terraform creates one instance of type t2.microin the us-east-1 region using the user’s access and secret key. The providerblockdetermines the provider and the resource block describes which resources that are pro-visioned. Additionally, Terraform has the ability to provide more than just computeinstances but for storage, networking, DNS entries and SaaS features and much more.

Listing 1:provider "aws" {

access_key = "ACCESS_KEY_HERE"secret_key = "SECRET_KEY_HERE"

18https://docs.openstack.org/heat/latest/19https://github.com/hashicorp/hcl20https://www.hashicorp.com/

12


region = "us-east-1"}

resource "aws_instance" "example" {ami = "ami-2757f631"instance_type = "t2.micro"

...#Additional blocks}

This work uses Terraform’s OpenStack configuration to provide the OpenStack basedinfrastructure that is used by SNIC Science Cloud. The basic configuration for Ter-raform’s OpenStack provider consists of the provider block, that is similar to the AWSexample above which determines the OpenStack provider and resources blocks thatdescribes the provisioned resources. Below is an example of an OpenStack configu-ration where a single instance is created under a specific user. Additional connectionvariables, auth url, tenant name, tenant id, user domain name are givento connect to the specific cloud. The single instance is created using given parameters tospecify, image name, the flavor, key pair and security groups. In this example, variablesare used as input parameters instead of static strings. Each parameter is referencing avariable that stores the argument. Using this method, variables can be set from exte-rior methods, for example from the command line interface, environment variables orexternal executable files.

Listing 2:provider "openstack" {

user_name = "${var.user_name}"password = "${var.password}"

tenant_id = "${var.tenant_id}"tenant_name = "${var.project_name}"auth_url = "${var.auth_url}"user_domain_name = "${var.user_domain_name}"

}

resource "openstack_compute_instance_v2" "example" {name = "example"image_name = "${var.image_name}"flavor_name = "${var.flavor_name}"key_pair = "${var.key_pair_name}"security_groups = ["default"]count = 1

13


...#Additional instance variables}...#Additional blocks

4.3 REST Service

A REST service is an architectural design pattern for machine to machine communi-cation [12]. By applying a REST architecture the separation of concerns principle isapplied, that is the separation of the user interface and the system back-end. The resultis that the portability and scalability of the system is improved. The REST service andthe rest of the system can be developed independently. A REST service requires thatthe client makes requests to the service. A request contains a HTTP verb which definesthe operation to perform, a header containing the data to pass to an endpoint. The fourbasic verbs are POST, PUT, DELETE and GET. The negotiator REST service has twocallable endpoints POST, DELETE

Using the POST request, the endpoints accepts the user arguments for provisioning theinfrastructure. The DELETE endpoint is then used to delete existing infrastructure. Theendpoints themselves use the functions of the negotiator module when called upon. Thisallows a flexible REST implementation where changes, such as new endpoints can bemade to the REST service without affecting the negotiator module. The REST serviceexpects the data in the header to be in JSON format and then replies with data in JSONformat. The JSON format is human readable and simple to use and supported by mostlanguages for easier integration. The REST server must be asynchronous otherwise theuser has to make a request then wait for the result, considering that provisioning a clus-ter may take several minutes. To solve this problem, the user is returned an id of therequest, The id is then bound to the request and any future calls on the requested infras-tructure is used in conjunction with the id.

Calling the REST server using a valid JSON object will eventually trigger the nego-tiator module. However to have everything execute without errors, the JSON requestmust include valid input arguments. There is only one main requirement for the RESTservice and that is a JSON object with the field provider. Listing 3 describes the min-imum requirement for a valid REST call. The provider field is used to describe whichprovider the negotiator module is supposed to call to continue the process. Additionalparameters are individually unique depending on the implementation of the infrastruc-ture configuration.

14


Listing 3:{

provider: ’some_provider(Openstack, aws, google etc.)’}

4.4 Message Queue

The system uses RabbitMQ which implements the advanced message queue protocol(AMQP)21. The message queue is placed between the REST service and the negotiator.The idea behind using the message queue protocol is to avoid having long running andresource intensive tasks. The tasks are instead scheduled to be executed when ready. Tosummarize the process, tasks are turned into messages and put in a queue until they areready to be executed. RabbitMQ itself is the broker which receives and delivers mes-sages. A queue lives inside RabbitMQ and producers, programs that sends messagesare producing messages which the broker stores in its queue. A consumer program whoconsumes the messages in the queue is run to handle the producer’s messages. Figure 3depicts an image representing the workflow of the message queue. The producer whichis in this work the REST service puts new messages (requests from the users) into thequeue. The consumer side of the system is then ready to execute the requests from thequeue.

The benefits of a message broker is that it can accept messages and thus reduce theload of the other programs such as the REST service. Consider the fact that the pro-visioning process takes several minutes. A synchronously implemented service willthen be on hold for the whole process and therefore lock other clients from connecting.The message queue avoids this problem, meaning that they can execute a request to theservice and then unlock the requesting process for something else. Another importantbenefit from message queues is the modularity, it is developed to be separated from therest of the system and it can be written in any language and started and run separatedfrom the REST server and the negotiator [5].

The system’s message queue is the middleman between the REST service and the nego-tiator. The REST service sends the parsed POST or DELETE request as a message fromthe user to the broker which then stores the message in the queue and waits for it to beconsumed. After consuming the message, the receiving part of the message queue willcall the negotiator to start the requested provisioning process.

21https://www.amqp.org/

15


Figure 3: Producers adds tasks to queue which consumers consume [6]

4.5 Data Aware Functionality

The data aware aspect is one of the main characteristics of the negotiator. The purposeis to pre-store metadata regarding the cloud provider to avoid having the user config-ure metadata infrastructure arguments. During run-time, the metadata is fetched from ametadata store which has pre-stored values from the user or another user. The metadatastore uses a key-value based data storage where the key of the data is the name andthe value of the data contains the relevant metadata that the negotiator module needs tolocate the data. Since different provider requires different metadata. The metadata isstored under a provider key which could be aws or openstack.

Listing 4 shows an example of metadata for an OpenStack provider. The variablesare required to start an instance on an OpenStack cloud. Looking at these variables theyare very tedious and most of the time kept the same and rarely changed. The externalnetwork id, tentant id are for example two variables that a user most probablydoes not want to have a responsibility to control. By pre-storing these variables the usersof this cloud does not have to keep track of these variables which reduces the amountof input parameters on the user side. However when a metadata parameter is changed, auser has to manually change it, this also adds a positive effect that helps multiple usersof the system where one change means that the other users do not have to change thesame variable. Compare this to if the variables are stored locally on a user’s machineand when a change happens all the users of the system has to change that variable.

Listing 4:"openstack": {"example_data": {

"external_network_id":"b8eigkt4-w0g84-bkeog-93833-shb029biskv",

16


"floating_ip_pool": "Network pool","image_name": "Ubuntu","auth_url": "https://cloud.se:5000/v3","user_domain_name": "cloud","region": "Region One","tenant_id": "r2039rsovbobsaboeeubocacce","project_name": "tenant"

4.6 Negotiator Module

The negotiator module accepts the user arguments from the message broker which re-ceived the request from the user via the REST service. Then according to the argumentsprovisions the requested infrastructure. To start, the module finds the provider ar-gument which was mentioned in section 4.3. By looking at the provider argument themodule can find the implementation of the system corresponding to the provider. Usingthe provider value, the provisioning can begin.

4.6.1 Resource Availability

The first part of provisioning is pre-determining if the provisioning is possible withregard to the available resources. The meaning of available resources depends on theprovider and implementation. Cost based providers which have virtually unlimited com-putation can implement the module to check if there is enough balance to provision theresources while in non cost based the general determining factor is the amount of com-puting that is available. By pre-determining the available resources the negotiator candetect if the system is going to be stopped later due to any insufficient resources error.

By using the provider value, the negotiator finds the file that corresponds to the provider.The file must contain the function check resources(resources) that deter-mines if the resources are available. This is similar to how interfaces are built in objectoriented design. Each file must implement the check resources(resources)function. As an example, if the user is requesting resources and uses OpenStack as theprovider value then the module will look for the file called OpenStack. However thisstep can be skipped if there is no implemention of the check resources(resources)function. or if the file does not exist.

The end result after determining the resources is a boolean value which determines ifthe negotiator decides to exit the process if the resources are not enough or to continue

17


if there are existing resources available.

4.6.2 Terraform Configuration Generation

The second step of the provisioning process is to generate a Terraform configuration filethat represents the infrastructure that the user requested. Similar to the first step whereresources are checked, the module looks for the folder and file which both have the samename as the provider and calls on a function in the file. The requirement for the imple-mentation is that the file must be under a folder with the same name as the provider andthe file must have a function called orchestrate resources(request) thataccepts the request as a parameter. The function must return a valid Terraform JSONconfiguration. The Terraform configuration file is generated programmatically depend-ing on the user’s request. Different implementations of specific infrastructure uses theuser input differently.

One of the core power of the module is to use metadata that is bound to certain datablobs. That is the data awareness function. The metadata is then collected from thename of the data that the user has requested. In the cases where the user should pass thename of the data. Using the potential metadata and user data, a python dictionary thatcorresponds to a valid Terraform JSON configuration is generated and returned to themodule which will later convert into a Terraform JSON file. The Terraform configura-tion implementations are described in section 4.8.

4.6.3 Executing Terraform Scripts

The previous steps creates the Terraform configuration file, while the last step is toexecute it. The command terraform apply is used to execute a Terraform config-uration to provision the infrastructure. However the command has to be executed in thesame file location as the files are located. Since each configuration is id specific, themodule moves the files for each configuration under a folder with the corresponding id.Any files under the provider folder are moved under the folder with the correspondingid. terraform apply command is then executed to start the provisioning. Whenthe execution is finished and the infrastructure is created. The user may be notified withdifferent means after execution.

18


4.7 Tracing

Considering the fact that there can be a bundle of errors in the provisioning process.Everything from name errors, network errors etc. debugging and finding errors is a timeconsuming process. Integrating a tracing system can assist in locating the errors in theprocess. This work implements a tracer to trace the provisioning process from top tobottom. Each unique request is traced starting from the user request until the process iscomplete in the cloud. Tracing through the REST server and the negotiator is the same,however the tracing is implemented differently for each type of orchestration. There aredifferent methods to track the process. This project uses OpenTracing in combinationwith Jaeger 22, using Jaeger bindings to python23 to trace the whole process in a webinterface.

OpenTracing is an open source tracing API used to trace distributed systems and Jaegeris a python library implementing the API. A trace describes a flow of a process as awhole, a trace propagates through a system and creates a tree-like graph of spans, thatrepresent a segment of processing or work. Using a tracing framework one can thentrace the error prone or time consuming processes. The trace is implemented in a wayto create spans of each part in the system.

Listing 5 shows the initialization of a tracer. A trace object is created from the jeager clientpython library which is Jaeger’s python implementation. The important parameter tolook at is the reporting host which is the address to the machine that is hostingthe Jaeger server. The trace object will forward the traces to the server.

Listing 5:from jaeger_client import Config

def init_tracer(service):config = Config(

config={’sampler’: {

’type’: ’const’,’param’: 1,

},’local_agent’: {

’reporting_host’: ’x.x.x.x’,’reporting_port’: 5775

},

22https://github.com/jaegertracing/jaeger23https://github.com/jaegertracing/jaeger-client-python

19


’logging’: True,’reporter_batch_size’: 1,

},service_name=service,

)return config.new_tracer()

tracer = init_tracer(’Trace’)

To trace a distributed system, each part of the system must be bound to the same trace.A span object which is a key-value pair is sent through the process starting from wherethe trace is created. The trace begins when the REST server accepts the user request.The span object is forwarded through the system. When the REST server reaches theRabbitMQ sender, then the span object is forwarded through the message that is sentto receiver through the message’s header properties. The span continues after receivingthe message until the Terraform provisions the infrastructure. The span object is sentto the requested infrastructure by writing the value of the span object to a Terraformvariable. The negotiator can then use this Terraform variable to send the span overto the infrastructure and continue the trace there. By continuing to run python scriptsinside the machines in the infrastructure the trace is continued.

4.8 Infrastructure Implementations

Four different configurations are implemented. However as mentioned in previous sec-tions there is a possibility of creating more implementations as long as the implemen-tation rules are followed and the requested configuration to implement is supported byTerraform. The configurations implemented in this work are the following which willbe described in the next sections:

• A general Spark Standalone cluster

• Haste specific HarmonicIO cluster

• Haste specific application to load microscopy images

• A configuration to run a container application

The method for deploying infrastructure that require software installed such as Spark

20


Figure 4: Example figure of the orchestration of a compute cluster.

and harmonicIO is the use of docker containers. By tieing applications inside dockercontainers the difficulties of deploying the applications on different operating systemsis solved. The only limitations for Docker containers are that the machine must be ableto run Docker, however most of the common Linux distributions can run Docker. Thisallows the ease of deployment on different operating systems and versions of operat-ing systems. The same cluster can for example be run on Ubuntu and CentOS. UsingDocker containers in combination with Docker Compose24, the deployment is easeddown to configuring the compose file to run the container. After Terraform is complete,the negotiator will send an email to the user to notify the completion. The email is givenin the request to the service.

Figure 4 shows a typical example on how to create a compute cluster using Dockercontainers. The system access the machines in the cloud and starts the orchestration bycommunicating with the machines who are then using Docker images from a remoteDocker repository to download the containers which contain the programs to deploydistributed applications. The Spark standalone cluster, the HarmonicIO cluster and thedata image loading application (only with one machine) uses the same method.

24https://docs.docker.com/compose/

21


Terraform techniques used to execute scripts inside virtual machines are using provisionerblocks. They include methods to upload files and to execute commands through ssh.Additionally the data block is used for rendering run-time variables onto script files.

4.8.1 Spark Standalone Cluster

The main idea behind a Spark Cluster is to run functions on a distributed compute clus-ter. Meaning a cluster that spans several machines to increase the processing power. ASpark Standalone cluster25 is a Spark cluster that does not contain additional tools suchas YARN26 or Mesos27. The required parameters to provide for this configuration arethe worker count which is the number of spark workers the cluster is running, the datathat is used to determine in which region the cluster is to be placed, the public key ofthe user to later access the machines. Lastly, the name of the flavor to be used for thevirtual machines.

To provide a Spark cluster on the infrastructure, Docker containers were used. By tieingthe Spark application inside a Docker container, the deployment difficulties of deploy-ing the Spark cluster is reduced. To fetch and start containers, two separate DockerCompose files were used, one to start the Spark master and the other to start the Sparkworkers.

The Terraform configuration file is pre-written for the Spark cluster. Meaning the Ter-raform configuration file representing the Spark infrastructure is already configured ex-cept for some variables to adjust the cluster to the user’s request are left to interpolate.To configure the cluster according to the user’s request, the variables in the pre-writtenconfiguration file are set through variable interpolation.

The first step of creating the Spark cluster starts with spawning the master and giv-ing it a floating IP for outer access and spawning the number of worker machines thatwas requested, this number is interpolated through a variable that is set through the userrequest. After the machines are spawned, the scripts that are used in the machines areuploaded to the master through Terraform’s file provisioner. A snippet of how the scriptsare uploaded can be seen in Listing 6. All files in the scripts folder are uploaded to thehost machine in the connection block using a private key to ssh to the machine. Themaster machine is given multiple scripts. One bash script to run its commands, two

25https://spark.apache.org/docs/latest/spark-standalone.html26http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html27http://mesos.apache.org/

22


previously mentioned Docker Compose files for master and slave and one script that isto start the worker machines.

Listing 6:provisioner "file" {

connection {host =

"${openstack_networking_floatingip_v2.floating_ip.address}"type = "ssh"user = "ubuntu"private_key = "${file("${var.ssh_key_file}")}"}

source = "./scripts/"destination = "scripts"

}

After the master machine is instantiated, it executes one of the bash scripts to downloadDocker and Docker Compose to start a Docker image which starts a Spark master usingthe Spark master Docker compose file. Then it transfers the worker Docker Composeand the worker script to the worker machines using the scp command. Finally it exe-cutes the worker script inside the worker machines through the ssh command to startthe Spark worker container. The worker script downloads Docker and Docker Com-pose like the master however it uses the worker Docker Compose file to start the Sparkworker which connects to the master to form a cluster.

The script used for the Spark master contains a comma separated string of the pri-vate IP addresses of the worker machines. The Terraform method template fileis used to render the private IP addresses onto the master script. The method of ren-dering variables can be seen in Listings 7, 8. The worker instances’ IP addresses arejoined together and rendered on the slaves variable in the master script. To connectto the master from the worker, the master IP address is required in the worker DockerCompose file. So in the same fashion, the master IP address is rendered into the workerDocker Compose file using the same Terraform method template file. When theworker container starts it is then connected to the master. Finally the cluster is completewhen all the worker have finished executing. The end result is one Spark master runningon a Docker container and one or multiple Spark worker on separate machines runningcontainers that connects to the master. Figure 4 which was previously shown describesthe final result and part of the orchestration process. The system communicates with

23


the machines which in turn communicates with a remote Docker repository to form acluster.

Listing 7:data "template_file" "master_template" {template = "${file("scripts/master_script.sh")}"

vars {slave_adresses = "${join(",",

openstack_compute_instance_v2.slave.*.access_ip_v4)}"}

}

Listing 8:#master_script.shslaves=${slave_adresses}

4.8.2 HarmonicIO cluster

HarmonicIO [18] is a streaming framework developed through the HASTE project. Tosummarize HarmonicIO, it is a peer to-peer distributed processing framework. Its pur-pose is to let users stream any data to HarmonicIO and have the data directly processedin HarmonicIO worker nodes and then store the processed data in data repositorieswhich also lets the users preview the data before the process is complete. A HarmonicIOcluster operates with the master-worker architecture as well similar to a Spark cluster.There is an individual master machine and one or multiple worker machines that han-dles the data processing. Manual orchestration of a harmonicIO cluster is similar toorchestrating a Spark cluster. The important steps are the following, starting with themaster node:

1. Instantiate a master node machine.

2. Download the HarmonicIO remote repository

3. Run the bash script to install dependencies

4. Change the IP address in the configuration file for the master node

5. Finally run the master script to start the master node

24


Then for each worker

1. Instantiate a worker machine

2. Download the HarmonicIO remote repository

3. Install docker

4. Change the master address and the internal address in the configuration file

5. Finally run the worker script to start the worker node

The implementation to deploy a HarmonicIO cluster is similar to the spark cluster. Theuser parameters for the HarmonicIO cluster are the number of workers, the flavor of theinstances, which region to place the cluster and lastly the public key of the user to latergain access to the master node.

The Terraform configuration file is again pre-written to fit a harmonicIO cluster withdynamic variables that are input from the user. The configuration follows a similar con-figuration to the Spark cluster. A provider block is used for a master and another is usedfor the workers. A count variable in the worker block determines how many workers areto be deployed. Listing 9, 10 shows the configuration for the nodes. The master node isa single machine while the number of workers are determined by the count variable.Other variables are shown as well, the flavor, image, and key pairs to the instances.

Listing 9:resource "openstack_compute_instance_v2" "master" {name = "HIO Master"image_name = "${var.image_name}"flavor_name = "${var.flavor_name}"key_pair = "${var.key_pair_name}"security_groups = ["default", "Tony"]

network {name = "${var.network_name}"}

}

Listing 10:resource "openstack_compute_instance_v2" "worker" {

25


count = "${var.instance_count}"name = "${format("HIO-Worker-%02d", count.index +

1)}"image_name = "${var.image_name}"flavor_name = "${var.flavor_name}"key_pair = "${var.key_pair_name}"security_groups = ["default"]network {

name = "${var.network_name}"}

}

Once the machines are created. The next steps are to connect the machines to create aHarmonicIO cluster. A python script is provided, one for the master and another one forthe worker machines. The master python script follows steps (2) in 4.8.2 and onwardsto download HarmonicIO from a remote repository, install the dependencies, set theconfiguration file and to execute the script to start the master. Then it uses the scpcommand to transfer the python script and a worker bash script to the workers one at atime. The bash script for the worker is to install the python dependencies and the pythonscript is used for running steps (2) and onward. Both python master and worker scriptsinitiates a Jaeger tracer object which continues the trace and each step is wrapped insidea trace span to inform the trace server and the user that the steps are running. To ensurethat the trace is a continuation of the request trace from the negotiator, the Terraformvariable that contains the span object value is interpolated to the python scripts.

4.8.3 Loading Microscopy Images

This configuration implements a deployment and execution of a HASTE specific pro-gram that loads a certain set of microscopy images from a larger set of images

The input parameters are slightly different in this case, this configuration not a clusterrunning the master-worker architecture. The required input parameters are the sourceobject store container and the destination object store container, that is where to readand store the input and output. The data aware function is used here to locate whichregion the the process should be executed in. By giving the name of the object storecontainer to download from, the files are downloaded to the machine. The public keyand flavor is given as the previous implemented provisioners. The return address is alsogiven to let the user know when execution is complete. Lastly, the user can choosewhether the machine should be destroyed after it has finished running the simulations

26


using a boolean value.

Considering the fact that this infrastructure is not a cluster but rather a single machine,the Terraform configuration becomes simpler. The pre-written Terraform configurationfile is written with one provider block to create a single machine. A python script usedfor downloading dependencies, downloading object store files, tracing and running acontainer is provided along with a docker compose file used to download and run thecontainer.

After the machine is created, the python script is provided then it installs its own de-pendencies. The script is then run in the following way. It starts a tracer with a spancontinuing from the negotiator. Every following step is then wrapped in a span to createa trace around each step. The script downloads Docker and Docker compose used forthe container, it downloads the object store files from the given container. It starts thecontainer image using Docker compose with the object store files mounted as containervolumes.

The application inside the Docker container is written to read files from the mounteddirectory so it can execute its program and then store the result in a result file. The hostmachine can then use the result file and upload it to the object store. After this step,the host machine has finished executing and the negotiator will check whether the userrequested that the VM should be terminated when finished executing. If it is requested,the VM is destroyed.

4.8.4 Single Container Application

Lastly, there is a configuration that deploys a given container application from DockerHub28 on OpenStack. The purpose of this configuration is to create a general configura-tion for deploying containers. This configuration is similar to previous. The parametersare the URL to the Docker container on Docker Hub, the commands that the user wantto execute, the region, the return address and the public key. This configuration creates asingle machine in the requested region using the metadata store to fetch the region data.The machine runs a python script to trace and install the required dependencies and thenruns commands to download Docker and then it executes all the requested commands.

28https://hub.docker.com/

27

5 Results

4.9 Simple Web User Interface

To increase the user experience, a graphical user interface was developed. The webinterface is developed using the React library to create a simple single page application.The web interface is essentially a proof of concept. The web interface contains threesteps to simplify the process for the user. Figure 5 shows three images where each imagerepresents each step. The first image lets the user select a choice to delete or create newinfrastructure, the second image is where the user chooses the configuration and lastly,the input parameters are given for the configuration before pressing create which sendsa REST request to start the process.

1. Select a choice to create new infrastructure or delete existing infrastructure.

2. Select which configuration to request.

3. Fill in the parameters for the chosen configuration and create the infrastructure,

4. The trace URL is returned to the user which gives the user access to the trace ofthe requested infrastructure.

5 Results

The results of this work is a service allows users with a few clicks or a simple RESTrequest create infrastructure for scientific computing in the cloud. The benefits of theservice is the ability to theoretically create any type of infrastructure on any providerthat is supported by Terraform. That is, it is possible to further implement the system tocreate more infrastructure provisioning configuration than the previous four mentionedin the method section. This work implements three different options for infrastructureusing the OpenStack provider. An option to create a Spark cluster, another option tocreate a cluster for the HASTE specific HarmonicIO application, a third option to run asingle container and lastly an option to run a green channel application used for imageprocessing using a single machine with extra features that are automatic execution andstoring the files and finally automatic tear-down of the machine.

The service includes two key features. The data aware feature grants the system theability to pre-provide metadata to allow the user to omit providing the metadata. Trac-ing to improve the transparency between the service and the user. The user is grantedthe ability to access an overview of the state of the process.

28

5 Results

(a) (b)

(c)

Figure 5: Example web interface.

29

5 Results

Figure 6: Example trace of the Spark Standalone configuration

5.1 Spark Standalone Cluster

A request containing one worker with a trivial data set in a region inside SNIC ScienceCloud was sent to the service with the user’s public key, a trivial flavor and the user’semail address. The trace id and URL to the web interface is immediately returned afterthe request and the service starts orchestrating the cluster through Terraform using thepre-configured scripts. The result is three created machines where one is the masterwith a floating IP attached and two worker machines. The trace in Figure 6 shows howa master is created which in turn start a worker.

5.2 HarmonicIO Cluster

A HarmonicIO cluster was requested with two workers. The request includes the workercount, data, flavor, the public key and user’s email address. After sending the request,the trace id and URL to web interface containing the trace is returned. The REST serveraccepts the request and starts orchestration of a cluster with three machines. Where onemachine is the master with a floating IP attached and two worker machines.

The trace can be seen in Figure 7. The trace is split into three sub-figures. The first

30

5 Results

(a) (b)

Figure 7: Spans of the HarmonicIO trace.

sub-figure 7(a) Shows the starting point of the trace. The negotiator gets the requestand executes the Terraform configuration to start the orchestration. The master machineaccept the continuation of the trace and starts its own scripts to start the HarmonicIOmaster. The second sub-figure 7(b) shows how the workers are started. The first workerreceives the continuation of the trace and uses its script to start the HarmonicIO workerprocess. When the first worker has finished. The second worker starts the same processand the orchestration is complete.

5.3 Image Loader

A request was passed to the image loader configuration to load a set of images from atrivial container. A couple of containers in different SNIC regions were pre-provisionedin the metadata store with data that is related to the region and the configuration. Thatis for example network id, project name which is used for Terraform. The request wassent to the system with the name of a container in UPPMAX region as the source con-tainer to read images from with other parameters as well. Most importantly, the returnaddress to the senders mail address. The request also includes that the running VM isdestroyed when the docker container has finished processing. After sending the request,the id and an URL is returned to the user. The process accepts the request and creates a

31

5 Results

Figure 8: Trace including spans and time of the image loader configuration.

VM in the UPPMAX region because the negotiator understands that the process shouldbe executed in UPPMAX due to the metadata. The result of this execution ends witha set of images loaded in the same container and an notification to the email addressexplaining that the process is finished.

A full trace of the whole process is available in the jaeger client interface which canbe accessed with the id. The trace can be seen in Figure 8. The trace shows that theREST server receives the request and pushes it to receiver of the message broker. Thenegotiator then handles the request to begin creating the infrastructure. The Terraformconfiguration which is generated is executed to provision the infrastructure and the traceis sent to the VM where the execution of the script can be seen. The dependencies andcontainer objects are downloaded and the docker container is executed and run. Finallya notification through email is sent to the user.

5.4 Running a Trivial Container

To run the single container configuration a request with a Docker Hub url to trivial con-tainer application was sent to the service. The request also included the public key, thefloating IP address and trivial commands. The service returns the trace URL and startscreating the infrastructure. A single machine then downloads the dependencies and thecontainer from the given URL. The commands are executed and the process is finished.The spans of the trace can be seen in Figure 9. The first Figure 9(a) contains the ne-gotiator trace and Figure 9(b) contains the machine trace. The container is downloaded

32

6 Discussion & Evaluation

(a) (b)

Figure 9: Spans of the container application.

and some commands are executed.


This section evaluates the service of this work, comparisons are made against othersoftware who uses similar methods. As well to evaluate the selling points of this workwhich is the tracing, data aware function. Also some of the system’s drawbacks andweaknesses are reviewed.

6.1 Comparison Against Other Methods

To compare this work with other works a general overview of the process for provision-ing different type of infrastructure using similar methods is presented. Two differentapplications for creating compute clusters and a discussion regarding manual infras-tructure provisioning using no or little additional tools.

33


6.1.1 SparkNow

SparkNow as mentioned in the related work section is an open source project used todeploy a Spark cluster on Openstack. The summary of the workflow to deploy a Sparkcluster from SparkNow is to download the repository, export a set of environment vari-ables that comes from OpenStack metadata and to use the source Linux command onthe OpenStack RC file to set additional environment variable, next step is to use Packer(an image building software) to build an image. Then to configure additional metadataand variables for the cluster architecture to finally to orchestrate a cluster using Ter-raform.

SparkNow is perhaps more difficult deploy for the average user. It is required to havea considerable amount of OpenStack knowledge to set the environment variables plusknowing exactly which variables that should be used and where. Also, the user has toinstall multiple binaries that includes Terraform, Packer and Git. Some Linux familiar-ity is also required to deploy with SparkNow.

The work of this thesis provides a Spark cluster however with many features and theability to skip most of the required deployment steps. The main differences are thedata aware function so the user can pre-provide the metadata or have the variables pre-provided by someone else and the tracing mechanism. Also, zero installations are re-quired because this work provides a REST service accessible from the web or from thecommand line. However, SparkNow provides many options for the user to configure theSpark cluster differently, depending on the user’s need there are more options to con-figure the cluster by SparkNow. The amount of configuration that this work provides isthe worker count and flavor.

6.1.2 KubeSpray

Kubespray is also an open source project that similarly uses Terraform configuration toprovision a Kubernetes cluster. They use different deployment methods allowing formore options. One uses Terraform and another uses Ansible. Using Terraform, it ispossible to deploy on both AWS and OpenStack.

To create a cluster, KubeSpray requires that multiple applications have to be installedand many variables have to be set in different files. Also, the software and variables dif-fer between the OpenStack and AWS deployment. Still, Kubespray makes a huge effortto ease the deployment of Kubernetes. However from this work’s point of view. Thedeployment can be made even simpler by this work’s rest service, the need for installing

34


software is only required once. Providing the metadata is also only required once. Justlike SparkNow, KubeSpray offers much more of configuration.

6.1.3 Manual Provisioning

There are multiple ways to manually deploy any type of cluster involving multiple ma-chines or running computations inside a single VM. However the manual process islaborious compared to the multitude of solutions that has been developed so far. How-ever comparing this work and most other works, the manual process requires much moreextensive knowledge on the deployment process. Not only is it required to know howto deploy a spark cluster, that is installing dependencies and the required software onboth master and worker machines but also how to use the cloud provider which could beOpenStack, AWS, Google App Engine or any other provider. Deploying the Harmoni-cIO or Spark cluster manually is no easy feat either. A few HASTE members know howto deploy it otherwise there are manual instructions29. Consider a new member whodoes not know how to deploy HarmonicIO and if they would have to do it manuallythen it would be difficult and perhaps troublesome for other HASTE members. Thework of this thesis could then reduce the workload of the members of HASTE.

6.2 Future Development Complexity

The main selling point of this service is the potential of being cloud agnostic. It is al-ready theoretically possible for the service to provide infrastructure for different providersas long as Terraform supports the provider. However to achieve this potential, the ser-vice needs to be further developed adding more configurations than the three that hasbeen mentioned before and also adding the same configurations for different providers.Adding more configurations is not necessarily an easy feat. As for now, the minimumrequirement to add a new configuration is to add a folder with a file that returns a validTerraform configuration or to add a folder that has Terraform configuration ready. How-ever to develop a new configuration that is useful it is required to have enough knowl-edge about Terraform and the Terraform language to create new configurations.

The diagram in Figure 10 describes the current configuration and how to add a newconfiguration. The requirement to add a new configuration is to add a new class that im-plements the negotiator interface that has the function orchestrate resourceswhich returns a python dictionary. Since python does not explicitly have interfaces,the system is programmed in a way to simulate interfaces. However the difficulty here

29https://github.com/HASTE-project/HarmonicIO/blob/master/Readme.md

35


Figure 10: New configurations are created by interfacing.

is that the dictionary has to be a valid Terraform configuration that is equivalent to aTerraform configuration in JSON format. Programming the Terraform configuration isperhaps not easy and the programmer must have sufficient knowledge about the providerand Terraform to create a configuration. However, a simple Terraform configuration ismost often not enough to create a full infrastructure. It is often important to provideexterior scripts along the Terraform configuration to execute commands or install de-pendencies inside the machines in the infrastructure. Most importantly a python scriptto support the tracing mechanism under the orchestration process. Also, the becauserunning the python script with Jaeger tracing implemented require the machines to havedependencies installed.

6.3 Tracing

The tracing implementation does give the user more transparency regarding the processif it would not exist beforehand and problems that occur will be more understandablewith tracing. For example if there are fewer workers created for the Spark cluster then itcould be possible to see which code blocks were not executed and perhaps detect whatthe problem was. However there are some issues with the implementation. This workimplements traces around code blocks, which means that it is not possible to actuallytrace inside imported functions. Since the purpose of tracing is to see what is wrong. Itis still difficult to trace the problems of imported functions. The main issue is some ofthe longer running commands such as the Terraform apply command. This is oneof the longest command to run since it is provisioning the actual infrastructure. Thiscommand can generate different errors and with tracing it is only possible to see thatthere was one error and not which error. If the tracing could be injected into the functionthen there would be more possibilities to detect the types of errors that may occur.

Adding tracing further increases the code development complexity. Since each pro-

36


visioning configuration have a different type of implementation and scripts, each scriptneeds to include tracing differently. However it is possible to skip the tracing part forfuture configurations. Adding a trace that is actually useful is time consuming as well.

It is interesting to discuss the usefulness of the trace to different type of users. Tothe scientist, the trace might be incomprehensible and essentially useless. On the otherhand, someone who understands the provisioning process well could certainly use thetrace to understand the eventual issues during the process.

6.4 Data Aware Function

The data aware function reduces the metadata required for the user. The issues withthe data aware function is that the required metadata has to be pre-provided. This doesnot avoid the whole purpose of the function. There has to be someone at some pointin time to add the required metadata. Also, to add the required metadata the personrequires knowledge about the negotiator and the configuration implementation on howthe metadata is to be stored. Any changes to the implementation of the configurationmay require a change in the metadata as well which could cause problems where achange in the configuration could result in a change that breaks the service. This meansthat the users must rely on someone or themselves to provide the metadata, so thisservice works well in a perfect scenario where someone provides the metadata for theuser otherwise the point of the data aware function is meaningless.

6.5 Security Issues

Users may be reluctant to user the service because of other security issues. The securityissues depends on the configuration however one of the issues is that for Terraform toexecute scripts inside the OpenStack virtual machine Terraform requires a private keythat is stored on the server. If they key is acquired then the user’s machines might becompromised. An issue specific for the Spark and HarmonicIO configuration is that theprivate key is uploaded to the master machine because the key is required for the masterto connect to the workers. Also, for the image loader the user credentials are relocated tothe machine to authenticate and download the object store containers. There is certainlya security issue when sending sensitive data to the REST service due to the risk ofman in the middle attacks. That is another trade off to consider, because of the RESTservice implementation users do not require any installations however to interact withthe service, a connection over the Internet must be established and because of this, thesecurity issues might repel users.

37


6.6 Limitations of This Service

The main limitations in this service lies in the implementations of the four differentconfigurations. For each configuration, the user is locked in for a configuration thatuses certain set of parameters with a specific infrastructure configuration. Important tonote is that the main change the user can do in the Spark and HarmonicIO cluster isthe number of workers in the cluster. However if the user requires other functionalityor changes to the cluster it is not possible unless they do manual configuration after theservice has completed the initial creation. For example, if the user wants to use anotherSpark version, another Spark configuration or include external tools such as YARN orMesos. This limitation exists for the image uploader as well, it only has one purposewhich is to run the Docker container.

This can be solved by changing the configuration to allow for more parameters. How-ever to do these changes, the complexity of the configuration grows because more codehas to be added. For each change in paratemer, more code has to be added in the formof Terraform configuration and depending on the change there might be change in thepython code which means that more tracing code has to be added as well.

Because the foundation of this works is built on Terraform, the complete service islimited by Terraform, however Terraform is an open source tool that is continuouslyupdated. There is of course the problem where if Terraform becomes obsolete then thiswork can not progress anymore unless it extends to implement additional tools. It is noteasily possible for someone to extend the software outside Terraform. Adding configu-rations outside Terraform’s scope would be increasingly difficult.

It has been mentioned that with Docker containers, the applications can be deployedwith ease over multiple operating systems. There is however a limitation with the cur-rent implementation. Because a python script is required to be run along with bashscripts on the machines during orchestration of the configurations, the operating mustthen be able to run these scripts. For python to be run, plenty of dependencies are re-quired to be installed. A major issue is that Jaeger requires a python 2.x version to berun. This is the main reason to include a python script, this also means that the operatingsystem must support a python 2.x version.

38

8 Conclusion

7 Future Work

Because this work focused on OpenStack deployment. It would be interesting to im-plement cluster for different providers. Even though it is theoretically possible to runon, for example Amazon because Terraform supports the Amazon provider. Actuallyimplementing and seeing it work is to be desired. This would mean that a user wouldhave the option to deploy a single cluster configuration on two different cloud providersor the user would be able to have data on different cloud providers or eventually createa cluster on a the most appropriate provider depending on other parameters such as costor availability.

To further add more configurations and parameter options to each configuration it isimportant to keep the design simple. Some of the main points of discussion is how dif-ficult it is to further develop the system. From a design perspective it is possible thinkof this work like any other software system to implement more design principles suchas interfaces to add to the system’s longevity.

There are other tools and frameworks that are not Terraform that this thesis has not ex-plored. Another future implementation would be to write an abstract layer over multipledifferent infrastructure provisioning tools. For example, to combine the capabilities ofTerraform and Ansible over one layer so to let the user interact with both tools insteadof just one. Or use the previosuly mentioned SparkNow and KubeNow in combinationwith this system to create Spark and Kubernetes clusters.

The previously mentioned limitation of requiring the machines to be able to run a Python2.x version can be solved by having docker containers that include all the dependenciesthat are required such as python libraries and docker installed. However there comesa problem with development complexity. This means that the initiation of the machinerequires a Docker container and the application inside the container has to itself startDocker containers. For example a Spark cluster must run a Docker container whichitself starts the Docker master container.

8 Conclusion

This work implements a service which eases the infrastructure provisioning process forusers who want to create a Spark cluster and for users inside the HASTE project whoneeds to run a HarmonicIO cluster and an image filtering application. It is also possibleto further improve upon the service and add more configurations not only for OpenStack

39

8 Conclusion

which this work explores but for other cloud providers as well to create a service morein line with the cloud agnostic philosophy. The service’s tracing function adds moretransparency for the user and the data aware function potentially gives the user less re-quired parameters to deploy infrastructure.

The main difference between this work and other similar work that have been discussedis the trade-off between changeability of the infrastructure and the effortless deploymentof infrastructure. Adding more changes and configurations makes the infrastructuremore complicated to deploy. Other methods to deploy infrastructure that are comparedagainst do come with more power to change the infrastructure depending on the user’sneeds. Meanwhile, this work provides multiple different static infrastructures that aredifficult to configure and change or with few possibilities to make changes to the infras-tructure but with a compensating ease of deployment. For particular users who are notwell informed regarding cloud concepts, application deployment, but who would liketo use cloud infrastructures to run computations then this service would fit those userswell. While users who are well informed and require a specific cluster would perhapsprefer to user other deployment methods or to manually deploy infrastructure.

Applying another layer of abstraction over already existing software works quite wellhowever adding more layers comes with limitations to the users. To reduce the limita-tions further code complexity is required and the implementation difficulties increases.For the service to fully reach the cloud agnostic ideology then it has to provide configu-rations for all providers and all the possible types of infrastructure configurations. Thisleads to an immensely large project that is challenging and demanding to maintain.

40

References

References

[1] https://strategiska.se/pressmeddelande/200-miljoner-till-big-data-och-berakningsvetenskap/.Accessed: 2018-06-13.

[2] Cern data centre passes the 200-petabyte milestone. https://www.home.cern/about/updates/2017/07/cern-data-centre-passes-200-petabyte-milestone. Accessed:2018-04-22.

[3] Haste: Hierarchical analysis of spatial and temporal data. http://haste.research.it.uu.se. Accessed: 2018-04-07.

[4] Iaac for devops: Infrastructure automation using aws cloudforma-tion. https://community.toadworld.com/platforms/oracle/w/wiki/11715.iaac-for-devops-infrastructure-automation-using-aws-cloudformation. Accessed:2018-04-22.

[5] Rabbitmq. https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html. Accessed:2018-07-12.

[6] Rabbitmq. https://www.rabbitmq.com/tutorials/tutorial-one-python.html. Ac-cessed: 2018-08-13.

[7] Ska project. https://www.skatelescope.org/project/. Accessed: 2018-04-22.

[8] Danilo Ardagna, Elisabetta Di Nitto, Parastoo Mohagheghi, Sebastien Mosser,Cyril Ballagny, Francesco D’Andria, Giuliano Casale, Peter Matthews, Cosmin-Septimiu Nechifor, Dana Petcu, Anke Gericke, and Craig Sheridan. Modaclouds:A model-driven approach for the design and execution of applications on multipleclouds. pages 50–56, 06 2012.

[9] Marco Capuccini, Anders Larsson, Salman Toor, and Ola Spjuth. KubeNow:A Cloud Agnostic Platform for Microservice-Oriented Applications. In Fer-gus Leahy and Juliana Franco, editors, 2017 Imperial College Computing Stu-dent Workshop (ICCSW 2017), volume 60 of OpenAccess Series in Informatics(OASIcs), pages 9:1–9:2, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

[10] W. Chen, C. Liang, Y. Wan, C. Gao, G. Wu, J. Wei, and T. Huang. More: A model-driven operation service for cloud-based it systems. In 2016 IEEE InternationalConference on Services Computing (SCC), pages 633–640, June 2016.

41

https://strategiska.se/pressmeddelande/200-miljoner-till-big-data-och-berakningsvetenskap/

https://www.home.cern/about/updates/2017/07/cern-data-centre-passes-200-petabyte-milestone

https://www.home.cern/about/updates/2017/07/cern-data-centre-passes-200-petabyte-milestone

http://haste.research.it.uu.se

http://haste.research.it.uu.se

https://community.toadworld.com/platforms/oracle/w/wiki/11715.iaac-for-devops-infrastructure-automation-using-aws-cloudformation

https://community.toadworld.com/platforms/oracle/w/wiki/11715.iaac-for-devops-infrastructure-automation-using-aws-cloudformation

References

[11] T. Dillon, C. Wu, and E. Chang. Cloud computing: Issues and challenges. In2010 24th IEEE International Conference on Advanced Information Networkingand Applications, pages 27–33, April 2010.

[12] Roy T Fielding and Richard N Taylor. Architectural styles and the design ofnetwork-based software architectures, volume 7. University of California, IrvineDoctoral dissertation, 2000.

[13] Peter M. Mell and Timothy Grance. Sp 800-145. the nist definition of cloud com-puting. Technical report, Gaithersburg, MD, United States, 2011.

[14] Dirk Merkel. Docker: lightweight linux containers for consistent development anddeployment. Linux Journal, 2014(239):2, 2014.

[15] J. Sandobalin, E. Insfran, and S. Abrahao. An infrastructure modelling tool forcloud provisioning. In 2017 IEEE International Conference on Services Comput-ing (SCC), pages 354–361, June 2017.

[16] J. Scheuner, P. Leitner, J. Cito, and H. Gall. Cloud work bench – infrastructure-as-code based cloud benchmarking. In 2014 IEEE 6th International Conference onCloud Computing Technology and Science, pages 246–253, Dec 2014.

[17] Yassine Tabaa, Abdellatif Medouri, and M Tetouan. Towards a next generationof scientific computing in the cloud. International Journal of Computer Science,9(6):177–183, 2012.

[18] Preechakorn Torruangwatthana. S3DA: A Stream-based Solution for ScalableDataAnalysis. Master’s thesis, Uppsala University, 2017.

[19] Andrea Tosatto, Pietro Ruiu, and Antonio Attanasio. Container-based orchestra-tion in cloud: state of the art and challenges. In Complex, Intelligent, and SoftwareIntensive Systems (CISIS), 2015 Ninth International Conference on, pages 70–75.IEEE, 2015.

[20] Ian Unruh, Alexandru G. Bardas, Rui Zhuang, Xinming Ou, and Scott A. De-Loach. Compiling abstract specifications into concrete systems—bringing order tothe cloud. In 28th Large Installation System Administration Conference (LISA14),pages 26–42, Seattle, WA, 2014. USENIX Association.

[21] Andreas Wittig and Michael Wittig. Amazon Web Services in Action. ManningPublications Co., Greenwich, CT, USA, 1st edition, 2015.

[22] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Arm-brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,

42

References

Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Sto-ica. Apache spark: A unified engine for big data processing. Commun. ACM,59(11):56–65, October 2016.

[23] R. Zarnekow, S. Wind, K. Turowski, and J. Repschlaeger. A reference guide tocloud computing dimensions: Infrastructure as a service classification framework.In 2012 45th Hawaii International Conference on System Sciences(HICSS), vol-ume 00, pages 2178–2188, 01 2012.

[24] Qi Zhang, Lu Cheng, and Raouf Boutaba. Cloud computing: state-of-the-art andresearch challenges. Journal of Internet Services and Applications, 1(1):7–18,May 2010.

43

a service for provisioning compute infrastructure in the...

Documents