large-scale elastic data processing in micro-cloud environments based on streammine3g

7/28/2019 Large-scale Elastic Data Processing in Micro-cloud Environments based on StreamMine3G

1/54

Large-scale Elastic Data Processing in Micro-cloud

Environments based on StreamMine3G

Pawel Skorupinski

June 16, 2013

Abstract

Large-scale processing of data in own data centers as well as using cloud computing are two

well-established approaches, however they have their drawbacks when a concept of data as aservice is to be introduced. Therefore, a novel data as a service model, based on Micro-cloudswas presented [lea]. It provides a possibility of querying public data resources to companiesthat cannot afford to extract, store it or process all the data in house.

Characteristic of a Micro-cloud environment is that data is distributed over small, geo-graphically distributed, inhomogeneous data centers. That imposes a need for a new approachto the operators placement, highly aware of execution costs that might be influenced by trans-fer costs over WAN, low bandwidths over links and a limited safety of Micro-clouds locations.

In the thesis different algorithms searching for an optimal operator placement are pre-sented. It is shown that a relaxation of a problem to a linear programming (LP) problemwith an awareness of data processing topologies leads to good placements. Furthermore, theideas of defining the problem more precisely with an extension to mixed integer linear pro-gramming (MILP) problem as well as possibilities of a usage of metaheuristics are tentativelyanalyzed and described.

Contents

1 Introduction 3

2 Background 32.1 Concepts about a Data Processing Framework . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Storing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Background of Micro-cloud Environment Design . . . . . . . . . . . . . . . . . . . 52.2.1 Micro-clouds and Data Centers Concept . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Micro-clouds and Cloud Computing Concept . . . . . . . . . . . . . . . . . 6

3 Algorithms Description 63.1 General Factors to Be Considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 General Formulation of a Price-aware Operator Placement Problem (OPP) for the

Micro-cloud Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.1 Additional Constraint Regarding Operator Placement Problem . . . . . . . 9

3.3 Possible Topologies for Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Greedy Approach for an Operator Placement Problem . . . . . . . . . . . . . . . . 103.5 All in One Micro-cloud Approach for Solving an Operator Placement Problem . . 103.6 Approach Based on the Simplex Algorithm for Operator Placement Problem Solving 11

3.6.1 Transportation Problem [Chu] . . . . . . . . . . . . . . . . . . . . . . . . . 113.6.2 Connections between Operator Placer Problem, Transportation Problem

and Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1


2/54

2 CONTENTS

3.6.3 Reducing a Operators Placement Problem to a Transportation Problem . . 12

3.6.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.7 Approach for Choosing Hosts for Processing in Destination Micro-clouds . . . . . . 15

4 System Design 15

4.1 Persistent Model of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Model of the Physical System . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.2 Pricing Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.3 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.4 Profiles of Worker Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.5 Queries waiting for an Execution and Information on the System State . . . 16

4.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 mongoDB - a Technology for Historic Data Sources . . . . . . . . . . . . . . 17

4.2.2 Live Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 StreamMine3G Platform - a Technology for Event Processing [sm3] . . . . . . . . . 184.3.1 Accessop implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.2 Mapper, Workerop and Partitioner implementation . . . . . . . . . . . . . . 19

4.3.3 Implementation of the manager . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Design and Implementation of the Tasks Scheduler . . . . . . . . . . . . . . . . . . 20

4.4.1 General Architectural Approach . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.2 Component Model of Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.3 Data Flow within the System . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Implementation of the Placement Algorithms . . . . . . . . . . . . . . . . . . . . . 29

4.5.1 Implementation of All in One Micro-cloud Approach . . . . . . . . . . . . . 29

4.5.2 Implementation of an Approach Based on Simplex Algorithm . . . . . . . . 29

4.5.3 Implementation of Algorithms for a Solution Normalization and Choosing

Hosts for Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Evaluation 35

5.1 A Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Design of a Simulation Environment . . . . . . . . . . . . . . . . . . . . . . 35

5.1.2 Simulation of the Designed System . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 On Approximating the Price and the Time of the Solution Execution . . . . . . . . 40

5.2.1 Analysis of Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2.2 Analysis of Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.3 Analysis of Destinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.4 Calculations of the solutions time . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.5 Calculations of the solutions price . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Evaluations of Positioning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.1 Analysis of Simplex Algorithm Constraints in a Simple Mathematical Model 44

5.3.2 Tests of the System Implementation . . . . . . . . . . . . . . . . . . . . . . 45

5.3.3 Summary of the Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.4 Evaluations of Algorithms based on Measurements . . . . . . . . . . . . . . 49

6 Future Work and Conclusion 49

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Using Metaheuristics for Complex Constraints . . . . . . . . . . . . . . . . 50

6.1.2 Extended Mathematical Model with a Mixed Integer Linear Programming . 50

6.1.3 Modeling solutions for queries with more levels of worker operators . . . . . 51

6.1.4 Even Transfer Distribution over Connections . . . . . . . . . . . . . . . . . 52

6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


3/54

3

1 Introduction

A framework that gives a constant access to public resources of the World Wide Web is a very

powerful tool. It lets execute an analysis on an information shared on a daily basis betweenmore that one third of the global population [int13]. Nowadays, this wealth is available only tocomputer science giants, like Google or Yahoo. Since the main goal of those companies are toprovide World Wide Web search engines and technologies around them, that power is the essenceof their existence. There are though the companies that would like to get a possibility to analyzepicobytes of data accessible in the World Wide Web without putting an effort and money onbuilding and maintaining the extremely expensive processing environment. They would like tohave a simple access to all that data based on a data as a service (DaaS) paradigm.

Data as a service is a concept of providing data on demand to the user regardless of geo-graphic or organizational separation of provider and consumer [daa]. The biggest advantage ofthe paradigm is that costs of maintaining and processing data are distributed between all of thecustomers. Data demanded by customer can be some historic data kept in the persistent storage

or data retrieved live from external sources.Such a paradigm would be a great opportunity for companies that are big sellers around theworld. They have to put a big effort to choose the right marketing strategies so that they couldderive benefit. There is no better resource of data on how do those strategies work than the WorldWide Web. In an example of sport companies like Adidas, the sponsored athletes need to befollowed for how they are perceived by potential clients. Internet resources could give an instantfeedback from all around the world in such a domain.

Data as a service gives a possibility to avoid the big maintenance costs by single companies.To further reduce the costs of data processing in general, a novel approach that reduces thetotal cost of the system maintenance is considered. Data is to be stored and processed in small,geographically distributed data centers, called Micro-clouds. The new paradigm comes togetherwith new challenges such as dealing with inhomogeneity, high distribution of the system and lowbandwidths. Therefore fault tolerant, time- and price aware components need to be used in the

system to provide an access to data and to execute computations on it.The focus of the thesis is to present the algorithms that would find the optimal solutions for

operators placements over the system nodes that are aware of the specifics of the Micro-cloudenvironment. Those specifics were analyzed during the prototypic implementations of a dataprocessing framework that could become the core of the software in Micro-cloud environments.

The thesis is structured as follows. In Chapter 2, the background of the topic is given and themost important concepts and technologies are explained. In Chapter 3, algorithms to solve theproblem of operators placement inside of the Micro-cloud environment are described. In Chapter 4,the essential elements of the system design and implementation are explained. In Chapter 5,the simulation environment setup and algorithms evaluating solutions are explained. Then, themeasurements on algorithms are described. Chapter 6 contains ideas for further development ofplacement algorithms as well as concludes the thesis.

2 Background

This Chapter gives a background of the topic of the thesis.

2.1 Concepts about a Data Processing Framework

There are many concepts and technologies that need to be considered when building a large-scaleelastic live and historic data processing framework. The questions that need to be answered are:

How to retrieve World Wide Web data constantly into the system? How to store data inside of the system? How to process live as well as historic data?

The possible answers to those questions are analyzed below.


4/54

4 2 BACKGROUND

2.1.1 Web Crawling

Web crawling is a concept of a systematic and automatic World Wide Web data browsing [web].

Web crawlers would provide the streams of live data that could be accessed from inside of thesystem.

In principle, Web crawlers work as follows. To start the work, they require a list of URLs tovisit. They recognize hyperlinks in every of them, thereby finding the paths to further sites ofthe Web. Web crawlers need to be aware of the facts that a lot of WWW data gets updated andremoved very fast, as well as that the same content is often represented by many URLs. Therefore,the policies define the way of their behavior. That includes the rules on which pages to download,which pages to revisit, or how to coordinate the work of distributed Web crawlers.

There are many projects providing Web crawling functionality. One of them, available underApache license is Apache Nutch. It is highly scalable (up to 100 machines) and feature rich [nut].

2.1.2 Storing Data

In a created data processing framework there is a need for a convenient retrieval of a historic data.It is possible through a distributed storage technology. There are several approaches for storingvast amounts of data in a distributed manner. They all should provide strategies obeying faulttolerance policies, like replication or equal distribution.

The first big group are so called distributed document-oriented databases. They belong tothe family of NoSQL solutions. The systems are designed around a notion of a schema-free self-contained document. Every document in the system is identified by a unique key and fully describesitself. Although no schema is defined on their content, documents can be looked-up based on that.That differentiates that model from from a key-value store, where only the key gives an access toits values (the difference erases when key-values stores enable secondary indexing). Documents areoften encoded in formats like JSON or XML that can represent the structured data [dod] [dod10][int].

There is also a group of distributed systems that provide an access to data spread over machineslike it was stored on a local file system. They are called distributed file systems. They encapsulatefunctionalities characteristic for file systems, like a hierarchy of directories and access permissionsfor users [dfs] [hdf11].

2.1.3 Data Processing

To process a large-scale data in a distributed manner, a special programming model is needed.MapReduce is a paradigm that allows for massive scalability across a huge number of machines.The magic trick that makes this model so convenient is that basically processing is spread intotwo jobs. First job is called map. Its role is to take an original set of data and convert it into aset of tuples. The next job, called reduce, takes a set of tuples from the first job and reduces it

into a smaller number of tuples. That division of a work can allow a high parallelism inside of thesystem, as both map and reduce jobs can be distributed over the multiple nodes [wir] [ibm]. Anexample of how word counting algorithm works with the MapReduce is presented below.

1. Original set of data:

We are who we are

2. Set of tuples after a map job:

(We, 1) (are, 1) (who, 1) (we, 1) (are, 1)

3. Outcome of a reduce job:

(we, 2) (are, 2) (who, 1)

2.1.3.1 Approaches for Processing


5/54

2.2 Background of Micro-cloud Environment Design 5

There are two basic approaches for data processing. One of them is batch processing. Thisis the process of analysis of a huge amount of data at once, without any manual intervention. Itis meant to work on historic data [bat]. An example of a distributed batch processing engine is

Apache Hadoop, based on MapReduce processing model. Data of the big files that need to beprocessed can be divided between map jobs and then sent further to reduce jobs [had]. A typicalHadoop job takes hours and is run on dozens of machines. There can be one job run per inputdirectory [Kle10].

Another approach is dealing with potentially infinite streams of data coming live to the systemduring its execution. In the standard scenario, the goal of the stream processing engine is toidentify the meaningful events within those streams and to employ techniques on them such asdetection of complex patterns of many events, event correlation and abstraction, event hierarchies,and relationships between events [esp].

The scenario needed for the system is specific. Here both, streams of data as well as data ofbig files, need to be processed in the same manner. In other words, historic data processed byMapReduce jobs may be also enriched by live streams of data. Of course, there should not be a

limit of jobs defined on data. Event stream processing engines with a MapReduce interface exist,examples of which are StreamMine3G or Hadoop Online Prototype [hop].

2.2 Background of Micro-cloud Environment Design

Micro-cloud environment means a group of Micro-clouds connected and cooperating with eachother. Every of those Micro-clouds contains a group of nodes that can store data and host operatorsduring queries execution. They are physically grouped into racks. Every Micro-cloud can consistof a few racks.

The design of Micro-cloud environment is based on paradigms taken from data centers andcloud computing concepts. The similarities as well as differences in concepts are presented in thisChapter.

2.2.1 Micro-clouds and Data Centers Concept

Micro-clouds are different from the standard data center approach concerning better data distri-bution as well as green computing opportunities. They were designed with an awareness of datacenters disadvantages:

High percentage of energy waste Poor distribution

Data centers use a lot of energy. In 2010 it was already 1.3% of energy consumption in theworld (and about 2% in the USA) [Koo11]. The biggest problem is about how much of this energyis actually wasted. According to some researches, only 6-12% of the electricity powering servers

in data centers perform computations [Gla12]. Lots of the rest is used on cooling the devices andsurroundings, so that they do not get overheated.

Because of the sizes of data centers, they are typically poorly distributed. Consequently, it isoften the case that they are geographically far from original data sources and potential clients.

Micro-cloud is a new proposal created with an awareness of the drawbacks of the data centerparadigm. The proposal assumes that Micro-clouds would be small data centers, containing muchsmaller amount of nodes and racks. They could be potentially placed in households. Placing smalldata centers inside of houses could be pro-environmental in a few ways:

Heat produced by processing nodes, instead of being cooled, could be used for heating thehouseholds. That is the double advantage considering electricity usage, as there would beless of it spent on cooling of machines as well as on a process of the house heating

System nodes would be placed in already existing houses. There would not be a need ofbuilding new huge data center buildings


6/54

6 3 ALGORITHMS DESCRIPTION

The Micro-cloud approach comes also with big advantages concerning Data as a Serviceparadigm. Quality of Service could be potentially increased, as system would be better geographically-spread (data would be closer to clients). It would also assure a better bandwidth distribution, as

data coming from external sources and reaching external destinations (clients) would be normallydivided between a lot of Micro-clouds.

The Micro-cloud approach entails new challenges because of some of its characteristics. Sincea number of Micro-clouds in a potential environment would be big, it is going to be much moreinhomogeneous than the standard approach. Micro-clouds will have nodes with a various hardware,that means various values of parameters such as computation power. Some of Micro-clouds aregoing to be connected with low-bandwidth linkages, therefore awareness of data transfers wouldneed to be injected into algorithms. Some of them would be also placed in locations that are out offull control. Physical safety would therefore need to be substituted with security on the softwarelevel.

2.2.2 Micro-clouds and Cloud Computing ConceptBasically, a concept of cloud computing means proceeding distributed computing over a networkon many connected machines at the same time [ccw]. It gives a solution for a fundamental problemof IT companies which is how to increase capacity and extend capabilities without investing innew infrastructure. Moreover, cloud computing enables such extensions on the fly [EK].

There are a few models on how cloud computing services are offered to clients by providers[ccw]. According to infrastructure as a service (IaaS) model, clients get an access to virtualmachines and other resources, like storage or network. Platform as a service (PaaS) model isalready on an operating system abstraction level - clients get an access to a platform consistingof execution environments, web server, and database. Software as a service (SaaS) provides anaccess to applications and databases.

There is a property of all of the classical cloud computing models considering data processing.

Namely, the processing needs to be preceded with data transfer to the cloud. Data as a servicemodel does not stick with any of those models, as it provides shared data accessible for processingby cloud customers.

There are characteristics that Micro-cloud environment providing data as a service wouldneed to share in common with other cloud computing provides. One of them are pricing profiles.Amazon EC2 is an infrastructure as a service cloud provider that has very complex pricing profiles.Costs are defined for various virtual machine instances as well as transfers inside of the cloud andwith the external sources. There are also special pricing profiles provided for so called spotinstances. These are basically the computing capacities that can be exchanged between clients[ec2]. Their profiles change in time and have characteristic picks meaning that computing pricesuddenly grows for a short time. That happens when demand starts to exceed supply.

Such profiles would also characterize data processing queries inside of a Micro-cloud environ-

ment. Their structure could be influenced by at least two factors: the current heating demands ofthe household where the Micro-cloud is located; and current exploitation of the Micro-cloud. Theprices could be defined for processing on nodes as well as for transfers over WAN.

3 Algorithms Description

A Micro-cloud environment needs a novel approach for finding solutions for operators placement.There are a few reasons for that as well as a few factors that should be considered while looking for asolution. They are going to be explained in this Chapter, followed by the general formulation of theprice-aware operators placement problem for Micro-clouds environments. The algorithms that areexplained in the thesis aim in finding optimal solutions for a specific topology of how operators senddata streams between each other. Therefore, before the algorithms are mathematically described,different topologies and their possible impact on a placement strategy is going to be explained.


7/54

3.1 General Factors to Be Considered 7

3.1 General Factors to Be Considered

In normal cases, every portion of data inside the system is replicated a few times. Consequently,

for any portion of data there is always a choice where to take it from. Generally, depending onMicro-cloud environment characteristics, there are a few factors that could influence the way ofchoosing the replicas.

Lowering the price of the execution Lowering the time of the execution Reducing transfers through public networks Choosing sources that are closer to a client

Three first factors are dependent on each other, therefore it is hard to consider them as inde-pendent facets of algorithms. Instead, algorithms could be aware of one of them and with a highpossibility that makes them indirectly aware of other factors.

Because of the characteristics of Micro-cloud environment and StreamMine3G architecture,

feedback may occur. For example, it may sound favorably from a price-awareness point of view tobring operators processing data from an expensive Micro-Cloud to another, even if sources haveto be placed in this expensive one. However, this could rise the time of the execution (becauseof the bandwidth limits) and rise transfers through wide area networks. But, since it will takelonger time to transfer data to processing operators, also access operators will work slower, sotheir execution will cost more money.

Fourth - geographic factor is omitted in algorithms implementation. It is assumed that data isgoing to be located close to the client by default, therefore, in most of the situations, this factorwould not be important.

3.2 General Formulation of a Price-aware Operator Placement Problem(OPP) for the Micro-cloud Environment

As the price is said to be the most important coefficient during making decision on a placement,the operator placement problem considered during the thesis focuses on a minimization of it. Theformulation of the OPP presented in this chapter is on an abstract level. Additional technicalconstraints that should be taken into account are presented after the abstract description.

The objective of an operator placement problem is to determine which hosts should be chosento execute operators accessing the data of every key, as well as those to execute operators workingon that data, and which connection paths should be chosen between them to minimize the totalcost of processing with a given algorithm. It is assumed that access operators can be placed onlyon hosts from which they can retrieve data locally.

Let a set K be a set of all of the keys k, data of which needs to be accessed and retrieved tosolve the query, and H be a set of all of the hosts h that can be sources to at least one of the keysduring the execution time of the query.

Let a source s be a part of a set S and be defined as a pair of a key (that represents the dataof this source) and a host from which it can be retrieved. Every key and every host are sets ofsources. Hence:

s S(k K(s k) h H(s h)).

s would generally mean the amount of data to be retrieved by the source.Every algorithm for processing the data has a specific topology defined that is needed to

properly process the data. If assumed that there are m host key pairs s, n hosts able to run aprocessing operator wa on destination level , and so on, generally a solution topology could bepresented as on Figure 1.

Let it be assumed that destination d of any level represents a system node that is able toprocess data with a demanded for its level type of algorithm during the execution time of thequery. d would generally mean the amount of data to be processed by the destination.


8/54


Figure 1: General formulation for a topology of sources and a few levels of destinationsto process data in a demanded way

For every source and a destination of every level, except the last one, there exists a communi-cation path between it and every node of the next level. Every of those paths has its cost c. Forthe paths to nodes of a level , c equals a sum of costs of retrieving a unit of data on a source,transferring it and processing on a destination. For every next level it equals a sum of costs oftransferring a unit of data and processing it on a destination of that level.

For every communication path, an amount of data to be transferred through it x needs to bedefined. By defining all of the x values, a placement for all source and all the processing operators

would be defined and consequently OPP would be solved.Assigning (ki) to the size of the key ki there are equations:

k K(

(s k) = (k)),

nj=0

xij = si, 1 i m.

Those equations guarantee that all of the data of the keys is going to be transported todestinations.

On a level and every other level, for a node dj there is an equation:

mi=0

xij = dj = wa pk=0

xjk , 1 j n,

where wa is a data size reduction coefficient for an algorithm wa used on the destinationlevel and determines how many times data is typically reduced by that algorithm.

The price minimization function for the topology of would look as follows:

min z =mi=1

nj=1

cijxij +n

j=1

pk=1

cjkxjk + ... .

Clearly no negative commodities are to be transported on the paths

xij 0, xjk 0, ..., 1 i m, 1 j n, 1 k p, ... .


9/54

3.3 Possible Topologies for Data Processing 9

Figure 2: StreamMine operators in the system

3.2.1 Additional Constraint Regarding Operator Placement Problem

There are three main constraints for solutions of the operator placement problem that woulddetermine feasibility of the found solutions. The reasons why they should be taken into accountare listed below:

1. Time (processing speed limits) - directing data streams from many sources to one destinationcan cause that the destination will process the data slower as it comes and therefore increasethe total execution price

2. Rules of partitioning data between processing operators partitions - definitions of operators

and data partitioners in the processing engine determine how the data is split betweenpartitions of operators as well as whether there is some fixed number of partitions for anoperator defined. Only an awareness of this data can lead to finding optimal placements

3. Non-linear splitting of transfer between sources of the key - there might be data sources thatwould let only discreet splitting of source data between nodes (e.g. mongoDB and real-timedata sources belong to them). Algorithms unaware of that fact normally will be able to findsolutions that are only approximate to optimal solution

3.3 Possible Topologies for Data Processing

Topology describes how the operators are connected to each other in order to solve a query withina data processing system. There might be a need of a usage of different topologies of operatorsdepending on the operation that was chosen to process that data. Indeed, there could be also a

possibility of executing the same operation using various topologies - it would depend on whatprogramming model is used.

A prototypic framework was prepared to solve a word count algorithm based on a MapReduceparadigm. Hence, there are three operators needed in a topology. First is an accessop - a universaloperator accessing the data of historic and live sources. Next one in the topology is a mapper andfinally a workerop, playing a role of reducer. For every source, one accessop needs to be deployed.The architecture of connections between operators is shown on a Figure 2.

mapper and workerop operators should be partitioned into slices to increase the efficiency ofan execution. Moreover, a special logic needs to be employed into the way of how to partition databetween them. First important thing is that there is always one mapper slice per one accessopinstance. Data is partitioned in the way that only the one slice that is co-located receives all ofthe data from the access operator. Partitioner of data that is heading workerop slices guaranteesthat data will be split possibly equally between them. That lets achieve the situation that thedata flow between slices looks as presented on a Figure 3.


10/54


Figure 3: Distribution of StreamMine3G operator slices and the logic of data flowbetween them

A topology-awareness in this context would mean that an algorithm is aware of the fact thatevery host chosen as a location of an accessop slice and a mapper slice is going to send dataportions to every host chosen as a location of an workerop slice.

3.4 Greedy Approach for an Operator Placement Problem

First algorithm implemented is a greedy approach. Its characteristics are:

For every key, always will exactly one source be chosen workerop partitions will always be placed on the same nodes where accessops are chosen to

be placed

The strategy of an algorithm is to try to put operators in a low number of Micro-clouds buton various hosts (to avoid too long reading from the disk). That makes the algorithm somehowtime-aware and hence indirectly price-aware. Since this is the greedy approach, the decision onchoosing the hosts in made only once for every key - there is no feedback. There are two waysof how to choose hosts for keys that do not have replicas in already chosen Micro-clouds. One isprice-aware - the currently cheapest host is chosen. In the other one - the price-oblivious - therandom one is chosen.

3.5 All in One Micro-cloud Approach for Solving an Operator Place-ment Problem

This approach is based on assumptions that:

When data is smartly distributed over Micro-clouds, it will often be the case that all thedata queried can be retrieved from sources placed in one Micro-cloud

It will often be viable (at least from a time-consumption point of view) to retrieve all thedata from sources placed in one Micro-cloud

Since these assumptions sound reliable, an effort was made to implement this solution. Thoughtheoretically it is price-unaware, time savings caused by the fact that there is no WAN transfercan lead to cost savings as well.

The algorithm will search through the hosts and see whether there is a combination of themwhere all are placed in the same Micro-cloud and contain together all of the keys.


11/54

3.6 Approach Based on the Simplex Algorithm for Operator Placement Problem Solving 11

m M(k K(s k(s h h m))),

where m - a micro-cloud / a set of hosts, M - a set of all micro-clouds, k - a key / set of sourcesthat contain its data, K - a set of all keys that are to be processed, s - a source, h - a host / setof sources placed on it.

Every of such combinations found will be checked.

3.6 Approach Based on the Simplex Algorithm for Operator PlacementProblem Solving

This approach is based on the fact that it is the price minimization is the main goal of algorithms.Therefore, an effort is made to reduce the problem to the transportation problem. It is proventhat simplex algorithm can find the optimal solution of the transportation problem [ Chu].

To understand the similarities of OPP and TP, the transportation problem will be at first

presented. Then, the ways of simplifying OPP will be described. Finally, the process of reductionwill be shown.

3.6.1 Transportation Problem [Chu]

Transportation model is about to determine a minimum-cost plan for transporting a commodityfrom a number of sources to a number of destinations. It is provable that optimal feasible solutioncan be always found with simplex algorithm. Let there be m sources that produce commodity forn destinations. At the i-th source (i = 1;2; ...; m) there are si units of commodity available. Thedemand at the j-th destination (j = 1;2; ...; n) is denoted by dj . The cost of transporting one unitof the commodity from the i-th source to the j-th destination is cij . Let xij (1 i m; 1 j n)be the numbers of the commodity that are being transported from the i-th source to the j-th destination. The problem is to determine values of xij in the way that will minimize the

transportation cost of all commodities from sources to destinations.

The commodities transported from i-th source have to be equal to amount of commodityavailable in i-th source

nj=1

xij = si, 1 i m,

as well as those commodities transported to j-th source have to be equal to j-th destinationdemand

m

i=1xij = dj, 1 j n.

Naturally, total demand must be equal to the total supply


12/54


m

i=1

si =n

j=1

dj .

The minimization function of overall transportation cost looks as follows

min z =mi=1

nj=1

cijxij.

Clearly no negative commodities are to be transported on the paths

xij 0, 1 i m, 1 j n.

3.6.2 Connections between Operator Placer Problem, Transportation Problem andLinear Programming

The operator placement problem is considered to have a lot of things in common with the trans-portation problem. It is even more similar to it than to Warehouse Location problem [SPB77],because the choice of communication paths influences the price on sources, so they cannot betreated separately. The main difference is that there are specific goods available on chosen sources(data identified by keys) and every of those goods has to be fully shipped (transportation of itought to be split between sources). This difference can be easily defined as a constraint in thesimplex algorithm.

Not the original version of OPP is presented as a transportation problem but its simplifiedversion. There is always only one level of destinations considered. Destinations are not the hoststhat can process the data but groups of hosts of similar characteristics - namely the common Micro-cloud. This transformation is made is order to reduce the complexity of the linear programmingproblem data that is going to be solved by simplex.

In the operator placement problem, as in the transportation problem, there is no specialunit of transported data. Even if data of sources is stored in bigger chunks, they are looselysplittable between connections, when sending to destinations. Therefore the basis of the problemis considered to be a linear programming, not an integer programming.

Additionally, OPP has to deal with three additional constraints. There is an effort made topresent the time constraint and the rules of partitioning data between processing operatorspartitions constraint as additional constraints for the simplex algorithm. Non-linear splitting oftransfer between sources of the key constraint is supposed to be taken care of by later conversionalgorithms.

3.6.3 Reducing a Operators Placement Problem to a Transportation Problem

During this process, concepts of an operator placement problem are represented as concepts of atransportation problem. Values known from database and from previous calculations as well asapproximations are used to represent variables of transportation problem.

Let the minimization function z be the total price of the execution.

Sources si in the transportation problem will be all potential sources of data in OPP, so Host-Key pairs. As in step one we are looking for a solution on a Micro-cloud level, destinations d

jwill

be all Micro-clouds within the system.


13/54

3.6 Approach Based on the Simplex Algorithm for Operator Placement Problem Solving 13

A transportation cost cij will be the sum of prices of data retrieval in si, data processing in djand data transfer from si to dj . Route commodity xij will be a number of megabytes to be senton the path between si and dj .

3.6.3.1 Counting the cost

Let us say that Pij is a sum of prices of execution on: si - Psi and on dj - Pdj , and of transferbetween them, Ptij .

Pij = Psi + Pdj + Ptij .

Cost cij of the transfer on every path is proportional to Pij, but since there is a need forcorrectness that this cij is a price per megabyte of data, the equation for cost on a path betweensi and dj will look as follows

cij =Pij

q(si),

where q(si) is a method giving the size of data that is identified by a key belonging to key-hostpair si.

The equation for the price Psi goes as follows:

Psi = Tsi

t0+Texect0

pcpMsi(x)dx

Texec,

where t0 is a point in time when the query processing begins, Texec is the approximatedexecution time of the processing, Tsi is a duration of the key data retrieval for this source, Msi isa Micro-cloud where si is placed, pcpM is processing cost profile for that Micro-cloud.

The time of the execution on si can be presented as shown below:

Tsi = max q(s

i)

Vst(si) fvm(si),

q(si))

Vti1, ...,

q(si))

Vtim

,

where st(si) is a source type of this source, Vst() is the speed with which the data is retrievedfor this source type for a standard virtual machine, Vt are bandwidths on the output connections,vm(si) is a VM type for this source and fvm() is a factor of the acceleration between this and astandard VM type.

The equation for the price Ptij :

Msi = Mdj Ptij = 0,

Msi = Mdj Ptij = q(si)

t0+Texec

t0

ocpMsi(x)dx +

t0+Texec

t0

icpMdj(x)dx

Texec.


14/54


When a source and a destination are in the same Micro-cloud, this cost will be equal to zero.When they are in different ones, we need to take a look at ocpM(), an output cost profile of asource Micro-cloud, and icpM(), an input cost profile of a destination Micro-cloud and multiply

the average price during expected system execution time with an amount of megabytes that wouldbe transferred on this path.

The equation for the price Pdj is as follows:

Pdj = Tdj

t0+Texect0

pcpMdj(x)dx

Texec.

The time of the execution on dj for the connection with source si:

Tdj = max

Tsi ,

q(si)

V wt(dj) fvm(dj)

.

When counting the time of the execution, the execution time of the source should be consideredas well, because it was previously correlated with bandwidths of a transportation path, and ifthey are lower than speed of processing on the destination, the time of the execution on dj willbe dependent on them.

3.6.4 Constraints

To let an algorithm find the optimal solution for the problem that is possibly most similar to thespecific operator placement problem, constraints on commodities have to be as well defined.

3.6.4.1 Constraint K: Sources of a key have to produce data of the size of dataidentified by that key

This constraint is defined in a general formulation of the OPP. If K is a set of host-key pairswith the same key, sum of the data they retrieve from the source has to be equal to the size of thedata of that key.

K = {s0, s1,...,sn} ni=0

si = q(si).

3.6.4.2 Constraint D: Time (processing speed) limits inside of Micro-cloud

This is one of the three basic technical constraints in the OPP for Micro-clouds. It defineslimits about how much data can come into the destination Micro-cloud. Let Q0 be a total size ofall the keys from clients request, vmcntM - a number of VMs within the cloud that are going tobe free at a time, and W - an approximate number of workerops to be deployed.

dj Q0 vmcntMdj

W.

Let us say that vmcntMdj = 2 W. That would mean that only a half of demanded workerops

can be deployed in this Micro-cloud. In consequence, we let only half of the whole data flow intothis Micro-cloud.

3.6.4.3 Constraint T: Rules of partitioning data between operator partitions

Constraint T is the one that injects topology-awareness in the problem. It is needed due tothe logic behind the tested processing algorithm word count. In practice, the data flow headingfrom source to worker nodes is spread equally between all of the destinations. If K is a set of all


15/54

3.7 Approach for Choosing Hosts for Processing in Destination Micro-clouds 15

keys to be retrieved, D is a set of all destinations, (k) is a size of a key, xkd is a transfer size ofa key k data to a destination d:

k1 K k2 K d D : xk1d

(k1)= xk2d

(k2).

3.6.4.4 Constraint L: Limit of the retrieval size for every host

If H is a set of host-key pairs for the same host, sum of the sizes of data identified by all thekeys of those host-key pairs cannot be greater than a parameter of maximum size of data to beretrieved from one host Qmax.

H = {s0, s1,...; sn} ni=0

q(si) Qmax.

3.7 Approach for Choosing Hosts for Processing in Destination Micro-clouds

All in one Micro-cloud approach as well as an approach based on a simplex algorithm end upchoosing sources and destinations with a grain of a Micro-cloud. Also, an information given asan output of an algorithm might not be precise enough to be translated correctly into an input ofthe processing engine. Therefore an algorithm for normalizing a solution and choosing hosts forprocessing needed to be specified. The subsequent actions of this algorithm are following:

1. Normalization of unsplittable key sources - previous algorithms might choose that some dataretrieval would be split between a few hosts. However, some of the data sources might havetheir keys unsplittable. Therefore, an additional algorithm needs to be run to normalize the

solution - by choosing one host for data retrieval in those cases.2. Counting the number of workers needed in the system - source hosts placement should beanalyzed to find how many hosts would be needed in the system to process data withoutdelays

3. Counting the number of workers needed per Micro-cloud - based on a previous informationand outcomes of placement algorithms, a number of hosts to run processing in every Micro-cloud can be found

4. Normalization of the solution by conforming sources, keys, transfers to how will they lookduring execution - a whole solution is normalized according to topology properties definedin the processing engine for the type of processing specified in query by a client

5. Choosing processing hosts in every Micro-cloud - at this point of the algorithm destinationhosts that are going to run the processing can be chosen

4 System Design

To analyze the placement problem deeply as well as to test the placement algorithms in theenvironment possibly similar to the true Micro-cloud environment, a whole processing frameworkextended by scheduling components containing placement algorithms needed to be designed andimplemented from scratch. The fundamental technologies needed to be chosen. MongoDB waschosen to be the technology for data storage and StreamMine3G was chosen to work as processingengine in the system (reasons for those choices are explained later in this Chapter). Also, a lotof effort was put in creating a persistent model of the system, so that it would represent the realsystem reliably - that was essential for operator placements algorithms evaluation. The systemwas implemented in the way, so that the logic was divided into loosely coupled components, sothat they can be easily reused and extended in the future work on Micro-cloud environments. Keyelements of the system design are presented in this Chapter together with explanations on choices.


16/54

16 4 SYSTEM DESIGN

4.1 Persistent Model of the System

During implementation, a persistent model of the environment was designed. It is compatible with

the principle assumptions of the Micro-cloud environment. A short overview of solutions used inthe model are presented in this Chapter.

4.1.1 Model of the Physical System

In the model, the physical entities need to be represented. Under the Micro-cloud environment,there would be three levels of abstraction regarding the physical structure of the system.

The highest level of abstraction would be a view on all of the Micro-clouds within the system.They would have the attributes like a name, a host address and a geographical location. For everyMicro-cloud, there would be a network profile defined as well. It would specify the transfer speedsbetween the nodes inside of the Micro-cloud, as well as an output and an input bandwidth.

Every Micro-cloud would consist of one or more racks. A view on all of the racks in the systemwould be the next abstraction level. The main purpose of racks would be to group the physical

machines that would work as hosts in the system, therefore able to retrieve and process data. Aview on all of the hosts would be the lowest abstraction level. Every host would have its URL(the datum indispensable for establishing connections with services running on it) and attributesdefining its physical characteristics. They would determine the disk read speed and a computationfactor (meaning a computation speed on this host compared to some standard instance). Worthmentioning, host entity represents VM as well as its host - there is no differentiation made betweenthose two in the model.

4.1.2 Pricing Profiles

Every Micro-cloud in the environment would have its pricing profile defined. It would be definedin the way that for every period of time (it could be an hour or even one day) there would be the

price (per GB) of the input and the output traffic as well as the price of usage (execution) on thehosts within the Micro-cloud determined.

4.1.3 Data Sources

Information on every external data source that the system is supposed to use to retrieve data hasto have its representation in the persistent model. The main attribute of the data source is itsname - this would be an identifier known to define queries on it. Every data source would have itsmain instance listening on some host on some port. Moreover, for every data source there wouldbe a fixed port defined on which it would listen on all of the hosts it would be deployed. Obviously,a bunch of technical characteristics would need to be defined for every data source. It would be aname of a collection (database etc.) that stores the data, as well as an information about a type(historic / real-time), a specific technology, a version and an expected data transfer.

Data sources that do not have an internal system of data to hosts mappings, need to have thehosts running their services explicitly defined. In the example of the created system, this wouldconcern real-time data sources.

4.1.4 Profiles of Worker Algorithms

In the prepared model, it is assumed that it would be possible to determine some reliable processingspeed for every algorithm that would be possibly used in data processing, for every number ofslices of worker operator - excluding bandwidth limits on connections.

4.1.5 Queries waiting for an Execution and Information on the System State

After the optimal placement is found, every query to be executed would be stored together withthe time the execution should start. The persistent model would also store an information about


17/54

4.2 Data Sources 17

the system state with data about hosts that are currently retrieving / processing data as well asinformation about the hosts that are going to be used for processing in the future.

4.2 Data Sources

4.2.1 mongoDB - a Technology for Historic Data Sources

mongoDB was chosen as a technology for storing historic data within the system. The decisionon choosing it was preceded by an analysis of two scalable, high-performance storage systems- mongoDB and Hadoop Distributed File System (HDFS). In an abstract, they are based onpretty different paradigms. mongoDB is a NoSQL document oriented database, while HDFS isa distributed file system. However, mongoDB comes out with a lightweight extension GridFSthat provides abstraction layer of a distributed file system over the database. Therefore, bothtechnologies can be used in a very similar manner, however a number of differences between themmakes mongoDB fit the specific architecture of Micro-cloud environments better.

Below, characteristics of mongoDB (specifically GridFS) will be presented, together with thecomparison with the characteristics of HDFS. Moreover, their benefits and drawbacks, concerningthe specific architecture, will be stated.

4.2.1.1 File Storage

To provide an abstraction layer of file storage in mongoDB, GridFS extension is used. This isa specification for storing and retrieving files of a size over 16MB. The logic behind it (and whatactually makes that abstraction layer very simple) is that a file is simply stored as a group ofchunks of a fixed size. GridFS stores files in two collections. One is a collection of chunks. Secondone keeps matadata about the files (and is replicated separately from chunks collection) [mdba].It looks different in HDFS. Although the files are divided internally into blocks, externally they

are seen as one big stream of data. The chosen number of bytes can be read on the chosen offset[hdfa]. While reading mongoDB, the whole document/chunk of data needs to be read into thememory.

4.2.1.2 Replication

In mongoDB, replica sets are defined manually. That means, every of the replica sets has agiven definition containing nodes that will store its data. Consequently, if one file is placed on thesame node with another file, they are placed together on all of the nodes of the replica set [mdbc].Every replica set has one primary and many secondary members. The primary replica is used forwriting. Choice of the source of reading depends on the read preference - it can be either basedon member type (only primary, secondaries preferred etc.) or on geographical location (nearest).Since a role of the algorithm implemented during thesis is to choose the replica hosted in the bestplace and to place there an operator - a strategy that allows reading from any replica is chosen[mdbb].

The approach used in mongoDB is clearly different from how HDFS works. There are no fixedreplica sets specified. Instead, client sets replication factor and assigns data storing nodes to racks.Replication algorithm in Hadoop File System is rack-aware. There is a rule for storing three firstreplicas: first is stored on some node, second is stored on a node in another rack, third one isstored in the rack of the first replica. All the other ones are stored randomly [ hdfb].

4.2.1.3 Sharding

Since data replication in mongoDB always takes place between strictly defined nodes, there isa data partitioning needed to allow many replica sets and still keep it as one system. Thereforeshards are introduced [mdbd]. Sharding partitions a collection to store portions of data in different


18/54

18 4 SYSTEM DESIGN

replica sets. It takes care of an even distribution of data over machines (shard balancing). Shardkeys need to be defined by picking the fields of documents stored in the database.

Because of the way how horizontal scalability is injected into HDFS, there is no place for a

thing like sharding. System automatically scales out when new nodes are added.

4.2.1.4 MongoDB, HDFS and specifics of Micro-cloud environment

During a run of the placement algorithm, nodes which should host operator slices are going tobe found. It is the systems assumption that source operators are always co-located with sources.Therefore it was important to look for a solution that lets file systems client, run by the operator,to access the local data directly, without connecting to the main process (through the network).It was proved - such a solution works for mongoDB, as every mongoDB daemon process (mongod)gives the same access to the data as mongoDB shards routing service process ( mongos). A smalldifference appears when client is about to read GridFS file (file names are not replicated togetherwith data), therefore operators are given files id instead of a files name as an input. It was not

checked whether the same solution is possible with HDFS.The main problem with HDFS appears when considering architecture of Micro-cloud environ-

ment. Since there are many data centers over which data should be replicated, there is more thanone replication level needed. One could consider bringing rack-awareness on a level of Micro-cloudsand skip awareness on the rack level. But Micro-clouds need special policies that are connectedwith geographical distribution in regard to the clients. MongoDB enabling manual replica setscreation together with sharding using sharding tags (specific ranges of a shard key can be associ-ated with a specific subset of shards [mdbe]) is much more convenient and hence much easier tofit into the specifics of Micro-cloud environment.

4.2.2 Live Data Sources

Live (real-time) data sources is the second group of sources that can be used to retrieve data. Bydefinition, data retrieved by them is sent directly to be processed and is not defined by any keys.Examples of such data sources could be Web crawlers or instances receiving data real-time fromSmart Grids.

Within the system an exemplar live data source was implemented. It is the process with twochildren-threads. One of them is connections server. It listens on a port and accepts connections.New connections are added to the list of sockets that is shared between threads. The secondthread is responder. Every iteration, it takes all of the sockets that are on the list of connectionsand sends a new chunk of data to each of them.

An implementation of that source is not prepared for high loads - it was just made for testpurposes. Nevertheless, current architecture of the system assumes that there would always be infact only one client connecting to the socket, moreover on the same machine.

4.3 StreamMine3G Platform - a Technology for Event Processing [sm3]

StreamMine3G was chosen as a processing engine for event streams within the system. It has anopen logic for running either continuous or batch processing. It can also work according to theMapReduce model. Therefore, it is a good choice for the system, where processing engine has todeal with high loads of data coming from both historic and live data sources.

StreamMine3G is an event processing engine, designed for a high scalability, elasticity andfault tolerance. It can be run as a cluster of nodes, each of which has to run StreamMine3G aswell as ZooKeeper. ZooKeeper is a centralized service that takes care of maintaining the currentconfiguration on the cluster, moreover providing naming and distributed synchronization [zk].

There are two types of nodes within StreamMine3G - master or worker. There should be onemaster node hosting manager. Its role is to conduct the jobs over worker nodes by deployingoperators on them and taking care of removals when the job is done.


19/54

4.3 StreamMine3G Platform - a Technology for Event Processing [sm3] 19

Operators within the system can either receive data from external data sources or from otheroperators. Based on that they are going to be called either access operators (is short: accessop)or worker operators (workerop).

All of the data that goes through the StreamMine3G cluster should be considered as streams- unbounded sequences of tuples/events. Following the potential use cases of the system, thesecould be the streams of World Wide Web pages.

StreamMine3G allows to define any topologies. Every operator can have a few upstreamoperators and a few downstream operators. Upstream operators would send events to the operator,downstream operators would receive events from it. Topologies are defined by the manager.

Every operator can consist of a bunch of slices. Slices deployment can be seen as a physicalmapping between operators and cluster nodes. However, slice is not only a deployment of operatoron node but also a partition of operator. That means, it can be defined which data heading theoperator can reach exactly that partition of it. Components taking care of forwarding data portionsto proper slices are called partitioners.

4.3.1 Accessop implementation

The role ofaccessop is to provide a universal access to all of the external event sources that systemis supposed to use.

4.3.1.1 MongoDB data source adapter

MongoDB data source adapter ensures an access to the processes of mongoDB system in aspecial way provided by GridFS abstraction level, modified to let the process communicate directlywith the local node, skipping connection with mongos process.

accessop reading data from mongoDB data source has always a fixed list of files (or parts offiles) to be read defined at an input.

Changes in GridFS implementation There is GridFS implementation that comes to-gether with mongoDB library. Since its implementation is aiming the scenario when replica con-nects mongos process, it did not fully fit the needs of applications running on every StreamMine3Gnode. Original GridFS requires file name to be given at an input. But the system is built up in away that the collection with files metadata is not accessible on every node (actually only on nodesof one shard). Therefore, the code needed to be changed so that the files are not found by filename but only by file id (that is stored as metadata with every chunks data).

The other thing that was changed was about to enable listing and reading of files from anyreplica. To achieve that, queries in methods realizing those functionalities were extended by settingqueries option so that the system would let them be run on secondary replicas (slaves).

4.3.1.2 Real-time data source adapter

Real-time data source adapter basically connects to the socket on a given host and port address.What is different from mongoDB data source adapter is that, instead of the definition which datais to be read, the work period is limited.

4.3.2 Mapper, Workerop and Partitioner implementation

Implementation of operators running word count algorithm inside of the system is based on thecode available on StreamMine3G site and works according to the MapReduce model. It is onlyextended by a few mechanisms. mapper operator, the one with slices always co-located withaccessop instances, takes the data of the incoming event and splits it into single words. Every ofthe words is being emitted as a separate event.

workerop plays a role of the reducer in MapReduce paradigm. This is a stateful operator thatkeeps a map of words that already came with events, together with counters for each of them.


20/54

20 4 SYSTEM DESIGN

During an event processing, the counter of an appropriate word is incremented. At the work finish,every slice stores the final version of the state. Since every slice was receiving another range ofwords, in the final state of every of the slices there are the final outcomes for words that were in

that range.To guarantee an even distribution of events between all of the workerop slices, a custom par-

titioner was implemented. It analyzes both incoming partition key and incoming event. mappercan define that it wants an event to be broadcasted (it is used to notify about the end of stream).When partition key has another value, hash of a word stored in the event buffer is counted. Thena slice number that should receive an event can be counted with a simple formula.

sliceNumber = floor( (double)hash / double(0xffffffff) * double(slicesCount) )

4.3.3 Implementation of the manager

The StreamMine3G manager is a component of the system that receives an information about

newly scheduled tasks from the Scheduler component and is responsible for a correct deploymentof all the operators needed to proceed the queries, as well as for cleaning up right after theprocessing finishes. Basically, it needs to call methods of so called cloud controller that implicitlytakes care of spreading deployment data between the nodes with a help of ZooKeeper.

In the current implementation the database is used as a communication channel between theScheduler and the StreamMine3G manager. Every second, the manager queries the database forthe tasks, execution of which needs to be started during that second.

To guarantee a deterministic way of working of the manager, next step of the deploymentprocess is called only after asynchronous notifications about a successful deployment of the previoussteps came. An example of controlling the deployment of a task with two operators, every of eachhas one slice was presented on Figure 4. It presents that during the deployment every action isexecuted on every operator / slice (depending on the deployment state) before any other actionexecuted on any of the operators / slices. It is possible thanks to the (atomic) counter that counts

up to the moment when an appropriate number of responses from the cloud controller was reached.An information about properties of deployed objects is accessible through maps (string operator,string slice). Furthermore, relationships between ob jects (slices, their operators, their tasks)are stored by the manager.

During a removal of slices and operators, a different logic is used. Every of the slices sendsa notification to the manager every time any of its sources reaches the end of its stream. Thosenotifications are counted on the managers side. When their number reaches a number of sourcesof some slice, its removal is initiated. Generally, slices can be removed at various points of time(depending on when they finish their operations). There is no need to wait with this operation forother slices. A procedure of a removal of an operator is initiated, when a number of slices of thatoperator which are still working reaches zero. Every time the slice of an operator is removed, aninformation about a busy state of its host is updated, so that the placement algorithms running

simultaneously can have an up-to-date view on a system state.

4.4 Design and Implementation of the Tasks Scheduler

Tasks Scheduler is a component of the system that is responsible for finding placements for queriesdefined by clients and scheduling them for an execution by the processing engine. Precisely, it hasthe following roles:

Provide an interface to client that allows to define queries on both historical and live datasources within the system

Run the functionality of the operator placement solutions search Give an information to client about costs and time of processing his query Translate found solutions to the standard topology definition being an input for the Stream-

Mine3G manager


21/54

4.4 Design and Implementation of the Tasks Scheduler 21

Figure 4: Simple presentation of message exchange between three components duringa deployment of a task with two operators, every of which has one slice


22/54

22 4 SYSTEM DESIGN

Figure 5: An interface of the Scheduler component

The interface of Scheduler in presented on Figure 5. It consists of four methods.

First two methods are basically about input data definition. Methods useLiveDataSource() anduseHistoricalDataSource() let define what data should be processed during the query execution.Every of the sources in the system has its unique name stored in the database. Choosing a livedata source as one to be used by the query means that all the data produced by that source duringthe query execution will be a part of an input for the system. When a historical data source ischosen, a list of keys to be retrieved from this source needs to be provided. Data type for keysof that source would be dependent on data source technical type. For mongoDB these are uniquestrings (file ids) - not file names. It is assumed that mapping from file names to unique stringswould be done by external component responsible for translating business query to technicalquery.

Two other methods let client define details of the query specifics, as well as choose the foundoperators placement solution that sticks most to the requirements. ClientQuery object is expectedas an input of runPlacement() method. It contains an information about the way data shouldbe processed (worker algorithm type) and the time when the query processing is supposed to bestarted (it can be defined as null which means as soon as possible). The other two fields letdefine what is the preferred execution time and price for that query, but finding such a solution istaken out of the scope of scheduler component. Instead, runPlacement() method returns a list ofpossible solutions for placement found during the algorithms execution and let the client choosethe most adequate one. To let client confirm one of the solutions, the confirmSolution() methodneeds to be called. A number identifying the chosen solution needs to be passed as a parameterof this method.

4.4.1 General Architectural Approach

An effort was made to let the system be easily configurable from the outside. Therefore propertiesfiles were introduced. As a result, those points of implementation that encapsulate algorithmsthat can be classified as strategies could have been left open to a simple exchange based propertiesfile content. Right now, it applies mainly to the simplex-based placement algorithm.

4.4.2 Component Model of Scheduler

Figure 6 presents components inside of the system. Beside the processing engine - StreamMine3G,there would be the Scheduler component as well as SystemState component providing some com-mon methods for system state access for both other components. Scheduler consists of a fewsubcomponents. Every of them is described below.


23/54


Figure 6: Components inside of the system together with their dependencies


24/54

24 4 SYSTEM DESIGN

4.4.2.1 SchedulerInput Component

SchedulerInput consists of a class that implements the system interface and is responsible for

controlling communication between other internal components and external components (Sched-uler clients). It keeps basic data that is exchanged with client and that needs to be shared betweencalls of methods, such as mapping of keys to hosts for every data source defined, clients queryand a list of solution graphs when they are created.

4.4.2.2 Mapper Component

Mapper component implements functionality of building host-to-key maps for every of the datasources. It is called by SchedulerInput every time when useLiveDataSource() oruseHistoricalDataSource() are called with correct parameters.

In the system, mapping retrieval functionality is possible from two types of data sources:live and historical (based on MongoDB technology). Classes implementing the mapping share acommon method they inherit from the abstract parent class. Its task is to construct the map fromevery of the given keys to every of the system nodes that enables that keys retrieval.

For live data sources the operation of map creation is trivial. There are no specific keys definingdata of those sources, so basically every of the nodes running this data source is to be placed inthe map. Information about which of the nodes run every of live data sources is stored in theMicro-cloud persistent model.

For mongoDB data sources operation of the process building a map requires connecting to themongos process. It consists of two parts (see: Figure 7).

First part is about building a map of pairs: mongo key set of hosts. Possibly, such a datastructure could be kept inside of the system and refreshed only from time to time - right now it isbuilt up every time mongoDB source keys mapper is called. In the first step of that part, a mapof shard names and hosts that hold replicas containing those shards is created. In fact, every of

the shard definitions is stored in mongos configdb as a string containing the shards name anda list of hosts that belong to the replication set of it. Consequently, this operation is basicallyabout string parsing. Then, in the next step, all of the keys in the system are iterated. Everykey identifies a group of chunks of one file (sometimes the file is split to keep equal distributionbetween shards). For every key, the file id, first chunk and the name of the shard that stores it isretrieved. That enables to translate the previous map to the map expected as an outcome of thispart.

Second part is about to find the keys given as the system input in the map of mongo keys.The definition of both of those keys is slightly different, as keys given as an input represent wholefiles, while mongo keys represent groups of chunks of one file that are stored in the same shard.Consequently, it is possible that hosts-to-keys map that is output of the method will have morekeys than given in the input.

4.4.2.3 Placer component

The role of the component is to find solutions for placement of all of the keys given as an inputto the system, distribution of which is determined within Mapper component. Functionality ofthe component in initiated by SchedulerInput during runPlacement() method. It is supposed tobe called once after all of the needed data sources are defined. As an input, it receives:

A set of structures representing every of the data sources defined by client, each of whichconsists of:

a data source definition a structure with bidirectional mapping between keys and nodes hosting their data

A clients query


25/54


Figure 7: An example of combining sharding data from mongoDB with keys requestedfrom client proceeded in two parts

Placer is built up in the way that preplacement, placement and postplacement algorithms canbe exchanged independently of each other. During every call, no matter what algorithms are goingto be used, this order of calls is kept:

runAlgorithm( )

call ::prePlacement( )

call ::doRunAlgorithm( )

call ::postPlacement( )

The roles of those subsequent procedures are described below.

Preplacement There are a few steps taken before any of the placement algorithms. Theirrole is basically to analyze and correct data received from other components before placementalgorithms start working on it.

In the first step, the time of the execution is approximated. This value is needed by the furthersteps of preplacement, but might be as well used by the placement algorithms (simplex-basedalgorithm uses it). There was not a big effort done to make the approximation algorithm veryadequate but it is assumed that a very general approximation of possible execution period lengthis good enough at this point. The algorithm respects a condition that only data about Micro-cloudprofiles for the next 48 hours is reliable. Therefore, clients queries that have the start time set at

the point of time that is more than 48 hours ahead, are rejected at this point.There are two scenarios that are considered separately - either there is some historical source

or there is none. If there is, only the time of historical data retrieval will be taken into account.Then, the time approximation is based on two very simple (too simple) assumptions: the totalprocessing speed of data equals its retrievals speed; on an average host, data of the size of twoaverage-size keys is going to be retrieved. Those assumptions can be changed after an enoughnumber of runs of algorithms to make approximation more reliable.

When there are no historical keys, the time defined for real-time data retrieval is taken intoaccount. However, the approximate time period lengths are compared with the maximum timefor reliable analysis (ending when pricing profiles become unreliable - so after 48 hours from now).The lower of the values becomes the execution period length that is going to be used in furtheralgorithms.

Since only the time period counted in the previous step is taken into account right now, real-time keys need to be changed as if they were about to retrieve the data only within this period of


26/54

26 4 SYSTEM DESIGN

time. The time of their work that is after that period is now taken out of account, assumed thiscan be treated as another scheduling problem.

During the execution period the length of which was just approximated, some of nodes that

are said to be hosting the needed keys can be busy (executing other tasks). Therefore they shouldbe out of the interest of placement algorithms. Consequently, the procedure is run that checksevery of those hosts for being on the busy list during the approximated execution period. If thatis the case, it is removed from the mapping.

Placement The role of this procedure is to run placement algorithms. As shown in Sec-tions 3.4, 3.5 and 3.6, there are three main approaches for finding placement solutions. Duringplacement, a few of them can be called subsequently to build up a list of possible solutions thatcould be checked for feasibility and evaluated by a client.

Postplacement After a run of any of the placement algorithms, two key steps have to betaken:

1. Run of computations to approximate price and time of the execution2. Run a final analysis about feasibility of a given solution

An algorithm on solutions price and time approximation is precisely described in Section 5.2.After it is executed, solution can be checked against feasibility rules.

For a solution to be feasible, all of the hosts that are about to be used for the retrieval or theprocessing need to be free during a period of time defined for each of them during the solutionanalysis (price and time approximation procedure). If some host does not meet that expectation,a whole solution needs to be marked as infeasible and will no longer be treated as a solution ofthe placement problem. Lists of hosts that caused the fail of feasibility check are stored for thepotential future reruns of placement algorithms.

4.4.2.4 PriceTimeApproximator Component

PriceTimeApproximator is called during the postplacement routine inside ofPlacer component.Its role is to analyze solution graphs to find partial times and prices, as well as comprehensivetimes and prices, using algorithm described in Chapter 5.2. As an input, it receives a graph,expected start time of the execution and worker algorithm type. It works directly on a graph andsets all of the counted values inside of its structures.

4.4.2.5 StreamMineTaskPrepare Component

StreamMineTaskPrepareis responsible for a translation of a graph into a StreamMine3G man-

ager input. It is called by SchedulerInput during storeExecution() method run, naturally only ifsome of the solutions were already prepared by Placer. As parameters it receives the structuresindispensable for preparing managers input - the solution graph chosen by client and the clientsquery.

There is an algorithm implemented, the role of which is to analyze the structures that encap-sulate a found solution and translate them into StreamMine3Gs manager input.

The mapping has to be done in a specific way for every of the worker operator algorithms. Forthe word count algorithm that created system implements, three types of operators have to beplaced within the mapping. Below the way of describing properties for every of the operators isgoing to be presented.

Name For every task, there is a unique id generated and operators of this task have theirnames built up on it (to guarantee uniqueness). The source operators name is built up basedon a schema: taskuid operatortype uniquenumber (as there are many textitaccessops, every of


27/54


them gets its unique number). Other operators (mapper and workerop) have their names like:taskuid operatortype workeralgorithmname.

Wire with In a definition of every operator there is a place for the names of downstreamoperators of its upstream. This list is defined of every accessop and contains a name of mapperfor this task. A list defined for mapper contains workerops name.

Library path There is a properties file used during translation that contains library pathsto every type of an operator. Those names are used here.

Partitioner path As with library paths.

Parameters Currently, parameters are set only for the accessops. The parameters are filledup as follows:

Host - name of a source host (in practice - URL) Port - port on which service listens on hosts (specified in the data source definition) Source name - collection name (specified in the data source definition) Partition key - id of a mappers slice placed on the same node Data source implementation type, read preference type - set based on the data source type

taken from its definition Time limit - set for real-time sources, as defined in a clients query General keys - set for mongoDB data sources. A list of mongoDB file keys, defined by a key

string, the first chunk number and the last chunk number to be retrieved on this node

Hosts This is a list of nodes that are going to host any of the slices of the operator. Forworkerop, this is the list of names of destination hosts in the graph, for mapper - the list of namesof source hosts in the graph, for each accessop this is one of the source hosts name.

Key range size This parameter is set because of the way how partitioning between accessopsand mapper works. mapper is set to have the key range size equal to a number of its slices (soequal to a number of sources).

End-of-stream signals to shut down slice This value equals a number of original sourcesthat deliver data to the operator slice. For mapper and workerop slices this is the number ofaccessops. For accessops this equals 1.

4.4.2.6 SystemState Component

This component is created basically to keep the methods that are common for StreamMine3G

manager and Scheduler. Basically, they are about providing an access of Scheduler to the currentsystem state and a possibility of adding expectations of future changes to it, as well as providing apossibility of adding actual changes in a system state to manager. Right now, it is about keepingthe state and expectations of a state changes of host busy times up to date.

4.4.3 Data Flow within the System

The exemplar flow of data about one key that is included in the clients query is presented onFigure 8. In this example, this is mongoDB key. At the input of the system, it is defined as astring identifying a file. It is sent in this form to Mapper component. From there it returns backas a few groups of chunks that are parts of the file - every of which has got a set of hosts holdingthis data replicas assigned. Placer component takes this data as an input and converts it intopair: group of chunks one host. It might be the case that one previous group of chunks willbe split to few by Placer. Those pairs are sent to StreamMineTaskPrepare.


28/54

28 4 SYSTEM DESIGN

Figure 8: Presentation of how data description about an exemplar mongoDB keychanges during execution of Scheduler


29/54

4.5 Implementation of the Placement Algorithms 29

4.5 Implementation of the Placement Algorithms

4.5.1 Implementation of All in One Micro-cloud Approach

Micro-clouds that contain all the needed data are easily found in two steps:

Keys-to-hosts map that is an input of Placer component is translated to Keys-to-Microcloudsmap

Micro-clouds that appear to have all of the keys that are defined in the map are theMicro-clouds of algorithms interest

The thing that is left after that is to create a solution only by using nodes inside the Micro-cloud, for every of the Micro-clouds found. It is done in the steps as follows:

A set of nodes inside of the Micro-cloud that host any of the keys data is created For every host that is an element of that set, all of the keys that it stores, are taken Pairs host key from previous steps are used as definitions of sources within the system

Destination is defined generally as the Micro-cloud. Choosing concrete nodes as hosts ofworker operators is a common part with other algorithms and is described in another section

Steps on choosing sources for keys that are described above are rather a simplistic solutionand might be extended to the one more adequate. Since it is assumed that data of one key isreplicated no more than once within one Micro-cloud, the solution of picking always the first hostof every key can be considered good enough.

4.5.2 Implementation of an Approach Based on Simplex Algorithm

General task realized by the implementation of this approach is to prepare structures, so thatthey can be used to form an input for the Simplex-solving external component and store the datathat comes as an output of it. Therefore, for every potential connection for the solutions graph a

special object is created that implements a whole complexity of considered problem as it was thetransportation problem. Therefore it consists of getter/setter methods:

double getC();

double getTransfer();

void setTransfer(double transfer);

They encapsulate the abstraction of counting the cost for the problem as well as let the transferbe set according to the simplex outcome and got later for further processing. There is also a groupof methods that provide an access to all of the objects defining this connection, so the source s,its key k and host h, as well as dest