data mining for the internet of things with fog...

Data Mining for the Internet of Thingswith Fog Nodes

Ivan Kholod(&), Ilya Petuhov, and Maria Efimova

Saint Petersburg Electrotechnical University “LETI”, Saint Petersburg, [email protected], [email protected],

[email protected]

Abstract. The paper describes an approach of applying an actor model thatexecutes Data Mining algorithms to analyze data in IoT systems with a dis-tributed architecture (with Fog Computing). The approach allows to movecomputational load closer to the data, thus increasing performance of theanalysis and decreasing network traffic. Execution of the 1R algorithm in an IoTsystem with a distributed architecture and the results of the comparison ofdistributed and centralized architectures are shown in the paper.

Keywords: Internet of Things � Fog Computing � Data Mining � Distributeddata mining � Actor model

1 Introduction

Currently there is a rapid growth of stored information volumes obtained from differentdevices: sensors, cameras, mobile phones and others. These devices, connected by theInternet, are called Internet of Things (IoT). Cisco analysts consider the period of2008–2009 to be the birth of the Internet of Things because during this period thenumber of devices connected to the Internet exceeded the population of the Earth [1],thus making the ‘Internet of People’ the ‘Internet of Things’. According to Gartner, Inc.(a technology research and advisory corporation), there will be nearly 26 billiondevices in the Internet of Things by 2020 [2]. Therefore the amount of informationcoming from those devices will increase over time.

Today this kind of information is referred to as Big data. It is characterized by largevolumes of data, a variety of types and rapid generation. Such data is collected fromsensors in IoT systems. Data analysis is an important task in such systems.

Scalable data processing systems are used to perform analysis (including intellec-tual analysis). Examples of such systems are Apache Hadoop and Apache Spark. Theyare used to process huge amounts of data like those in the systems by Google, Yandexand other popular social networks. However they do not require a centralized storage ofthe processed data and do not allow to relocate computational load closer towards thedata sources, which would reduce traffic and therefore increase speed of the analysis.

Lately, IoT systems with fog nodes have become more popular. They are analternative to the IoT systems with a centralized architecture. The systems use fognodes to preprocess data. This paper describes an approach that allows to distributeanalysis between nodes and move it closer to the data.

© Springer International Publishing AG 2016O. Galinina et al. (Eds.): NEW2AN/ruSMART 2016, LNCS 9870, pp. 25–36, 2016.DOI: 10.1007/978-3-319-46301-8_3

The paper is organized as follows. Section 2 is a review of the approaches tocreating data mining systems for the IoT. The third section contains the description of ageneral approach that allows to map the decomposed algorithm onto blocks of theactors model. The fourth section describes the proposed approach to implementing thedata mining system for IoT with a distributed architecture. The last section discussesthe experiments and compares the approach with similar solutions.

2 Related Work

Most of the data mining systems for the IoT have a multilayer architecture (Fig. 1) offour levels [3, 4]:

1. The devices layer is the bottom layer. It can be viewed as a hardware or physicallayer which performs data collection.

2. The data gathering layer is responsible for connecting the devices layer and theapplication layer enabling data transfer between them. It also performs cross plat-form communication, if required.

3. The data processing layer is responsible for critical functions such as device andinformation management and also takes care of such issues as data filtering, dataaggregation, semantic analysis, access control and information discovery.

4. The layer of data analysis services provides services or applications that integrate oranalyze the data received from the other two layers.

The last level provides services to execute different analytical tasks. The majority ofexisting IoT systems have a centralized architecture. The data there is collected in asingle storage and is processed by the analytical services, which are also executed on asingle computing cluster. There are two approaches to building a centralized analyticalservice:

Data analysis services

Data processing layer

Data gather layer

Devices layer

Iot system

Iot system

External analytic cloud

a) b)

Fig. 1. Data mining for IoT systems with centralized architecture: (a) using internal data miningsystem, (b) using an external data mining cloud

26 I. Kholod et al.

• integration within existing cloud storages and analytic services [5];• implementation of new services based on existing scalable analysis systems [4].

Cloud analysis services provided by established companies can be used to imple-ment the first approach.

Azure Machine Learning (Azure ML) [6] is a SaaS cloud-based predictive ana-lytics service from Microsoft Inc. It has been launched in February 2015. Azure MLprovides paid services, which allow users to execute the full cycle of Data Mining: datacollection, preprocessing, features definitions, choice and application of algorithms,model evaluation and publication. The service is for experienced users with knowledgein machine learning algorithms.

Azure ML can import data from local files, online sources and other cloud-projects(experiments). The reader module allows to load data from external sources, theInternet or other file storages.

In April 2015 Amazon has launched their Amazon Machine Learning service thatallows users to train predictive models in the cloud [7]. This service provides all stagesrequired for data analysis: data preparation, construction of a machine learning model,its settings, and eventually the prediction. The user can build and fine-tune predictivemodels using large amounts of data.

It allows users to analyze data stored in other Amazon services (Amazon SimpleStorage Service, Amazon Redshift, or in Amazon Relational Database Service). Toscale computations, the service uses Apache Hadoop.

Google made its Cloud Machine Learning platform [8], which is used by GooglePhotos, Translate, and Inbox, available to developers in March 2016. It is a managedplatform that empowers users to build machine learning models. The platform providespretrained models and helps to generate customized models. It allows users to applyneural network based machine learning methods, which are used by otherGoogle-services including Photos (image search), the Google app (voice search),Translate, and Inbox (Smart Reply).

All of these services are provided by REST API for client applications. Users canonly analyze data stored in Google storage and cannot add new machine learningalgorithms.

Scalable data analysis systems can be used to implement the second approach.Apache Spark Machine Learning Library (MLlib) [9] is a scalable machine

learning library for the Apache Spark platform. It consists of common learning algo-rithms: classification, regression, clustering, collaborative filtering and other. It has anown implementation of MapReduce, which uses memory for data storage (versusApache Hadoop that uses disk storage). It allows to increase the efficiency of thealgorithm performance.

Apache Mahout [10] is also a data mining library concerning the MadReduceparadigm. It can be executed on Apache Hadoop or Spark based platforms. It containsonly a few data mining algorithms for distributed execution: collaborative filtering,classification, clustering and dimensionality reduction. Users can extend the library byadding new data mining algorithms. The core libraries are highly optimized and alsoshow good performance for non-distributed execution.

Data Mining for the Internet of Things with Fog Nodes 27

The disadvantage of the IoT systems with centralized approach is that it is nec-essary to send all the data from the sources to the place where it will be analyzed. Thisincreases the network traffic and the time that the analysis takes in whole. This becomesa significant restriction when analyzing big data in real-time.

Fog Computing that became popular recently is an alternative to the CloudComputing [11]. Fog Computing enables a new breed of applications and services, andthat there is a fruitful interplay between the Cloud and the Fog, particularly when itcomes to data management and analytics. The IoT systems that use Fog Computinghave intermediate fog nodes at the level of intermediate levels of the IoT systems(Fig. 2) where the data analysis is performed without the data being sent to the cen-tralized storage.

Such architectures are popular due to the absence of the drawbacks describedearlier. However neither existing cloud analytical services nor systems that performscalable data analysis can be used for such systems. The suggested approach and it isimplementation on the actor model allow to solve this problem.

3 The Essence of the Approach

3.1 Presentation of Data Mining Algorithm as a Set of Functional Blocks

According to [12, 13], a data mining algorithm can be written as a sequence offunctional blocks (based on the principle of functional programming). A data miningalgorithm can be presented as a sequence of function calls:

dma ¼ fn d; fn�1 d; . . .:fi d; . . .:f1 d;mð Þ. . .ð Þ::ð Þð Þ; ð1Þ

Data analysis services


Data gather layer

Devices layer

IoT system

Fog nodes

Fig. 2. IoT system with fog nodes

28 I. Kholod et al.

where fi: is a function that analyses the input data set d of type D and changes themining model m of type M. This function is called functional block. It is of the type:

FB :: D ! M ! M; where

– D: is the input data set that is analyzed by functional block,– M: is the mining model that is built by functional block.

Note that not all of the functional blocks of the data mining algorithms need to usethe data:

fci d;mð Þ ¼ fci nil;mð Þ

Such blocks are called calculation functional blocks. Accordingly the blocks, whichuse the data:

ffb d;mð Þ 6¼ ffb nil;mð Þ

are called processing functional block. Thus if the algorithm is represented as a set offunctional blocks:

A ¼ f1; f2; . . .; fi; . . .; fnf g;

it is possible to divide the set into two subsets depending on the functional block’stype:

A ¼ Ac [Af ¼ fc1; fc2; . . .; f

ci ; . . .; f

cv

� �[ ff1; ff2; . . .; f

fb; . . .; f

fw

� �:

A data mining algorithm is also a functional block since according to (1) it can bepresented as a composition of functional blocks:

dma ¼ fn � fn�1� . . . � fi � . . . � f1:

The different flowchart structures (decisions, loops and other) can also be presentedby functional blocks [14]. For example, we rewrite the 1R [15] algorithm as set offunctional blocks:

• the conditional function which checks weather the current attribute is a targetattribute. If so, it calls a composition of two functional blocks:– addingOneRule: adding a new rule to the mining model,– incrementOneRule: incrementing the count of vectors validated for this rule

isCurrAttrTarget = if cf (d, m) then addingOneRule°incrementOneRule, where– cf – function to calculate the conditional expression,

• loop - function, which calls the functional block isCurrAttrTarget for all theattributes of the current vector,attrsCycle = loop’(d, finitA (d, m), cfA, fpreA, isCurrAttrTarget), where


– finitA: the function initializes an attribute counter with the first index of the list ofattributes

– cfA: the conditional function checks whether all the attributes have beenprocessed

– fprevA: the preprocessing function changes an attribute counter assigning theindex of the next vector

• the vectors cycle function calls the functional block attrsCycle for all vectors,vectorsCycle = loop’(d, finitW (d, m), cfW, fpreW, attrsCycle), where– cfW: conditional function, that checks whether all the vectors have been

processed– finitW: the function, which initializes a vector counter with the first index of the

list of vectors– fprevW: preprocessing function, that changes the vector counter by assigning the

index of the next vector• loop - function, which for all values of the target attribute calls a functional block:

– selectBetterScoreRule: selection of rule with minimal error for the current valueof the target attributetargetsValuesCycle = loop’(d, finitT(d, m), cfT, fpreT, selectBetterScoreRule),where

– cfT: conditional function, that checks whether all the values of the target attribute(classes) have been processed

– finitT: the function, initializes a class counter with the first index of the list ofclasses

– fprevT: preprocessing function, that changes the class counter by assigning theindex of the next class

• cycle function that calls functional block targetsValuesCycle for all rules:rulesCycle = loop’ (d, fbinitR (d, m), cfR, fbpreR, targetsValuesCycle), where– cfR: conditional function, that checks whether all the rules have been processed– finitR: the function initializes a rules counter with the first index of the list of rules– fprevR: preprocessing function, that changes the rules counter by assigning the

index of the next rule

So, the 1R algorithm can present as composition of two the functional blocks:

1R ¼ rulesCycle � vectorsCycle ð2ÞThe functional block rulesCycle does not require the presence of the dataset and is a

calculation functional block. The vectorsCycle block processes data and is a processingfunctional block.

3.2 Conversion of a Data Mining Algorithm into Parallel Form

According to the Church-Rosser theorem [12] the reduction (execution) of functionalexpressions (algorithm) can be done concurrently. The expression (1) has to betransformed into a representation, from which the functional blocks will be invoked asarguments. For this purpose a function parallel which takes care of data-parallelizationin the algorithms has been added [16].

30 I. Kholod et al.

Using the parallel function, different parallelized forms of one data mining algo-rithm can be created. For example, the 1R algorithm can be converted into the fol-lowing parallel forms:

• with parallel processing of the data sets by the vectors:

vectorsCycleParall ¼ parallel d;m; vectorsCycleð Þ1RVectorsCycleParallel ¼ rulesCycle � vectorsCycleParall: ð3Þ

• with parallel processing of data sets by the attributes:

attrCycleParall ¼ parallel d;m; attrsCycleð ÞvectorsCycle ¼ loop0ðd; fbinitW ðd;mÞ; fbinitW ; fbpreW ; attrCycleParallÞ1RVectorsCycleParallel ¼ rulesCycle

�vectorsCycle:

ð4Þ

3.3 Mapping a Data Mining Algorithm on a IoT System with Fog Nodes

The IoT system can be represented as a union of two sets of nodes:

S ¼ C[F ¼ nc0; nc1; . . .; n

cp; . . .; n

cu

n o[ nf0; n

f1; . . .; n

fq; . . .n

fz

n o; where

– ncp - computing node of a system that does not store data and is used to perform theanalysis services (located at the data analysis services level on the Fig. 2),

– nfq - a node of a system that stores data and is used for preprocessing (fog nodes onthe Fig. 2).

To execute analysis algorithms in such systems, the actor model [17] has beenproposed [18]. The execution environment based on the actor model can be representedas a set of actors:

E ¼ r; a0; a1; a2; . . .; aj; . . .; ag� �

; where

– r – the router, which distributes messages among actors,– a0 – the actor, which carries out the main algorithm sequence,– a1–ag – the actors, that carry out the parallel function of the algorithm.

Actors can execute functional blocks and therefore run a distributed execution ofthe data mining algorithm [18]. The described approach was implemented as the datamining library DXelopes [19]. The library has adapters for the integration in the actorsenvironments [18].

The actors environment was used to create the prototype of the distributed IoTsystem with fog nodes. Mapping actors to the nodes of the system divides the set ofactors into two subsets: computing and processing:


E ! S ¼ E ! ðC[ FÞ ¼ Ec ! Cð Þ [ Ef ! Fð Þ ¼f acj ; n

cp

� �j acj 2 E; ncp 2 Cg[ f afr ; n

fq

� �j afr 2 F; nfq 2 Fg

The types of the functional blocks in the Data Mining algorithm can be recon-sidered when mapping them to actors. The blocks that interact with data should belocated at the processing actors and the blocks that do not interact with the data shouldbe located at the computing actors:

A ! E ! S ¼ A ! E ! ðC[ FÞ ¼ Ac ! Ec ! Cð Þ [ Af ! Ef ! Fð Þ ¼f fci ; a

cj ; n

cp

� �j fci 2 A; acj 2 E; ncp 2 Cg[ f ffb; a

fr ; n

fq

� �j ffb 2 A; afr 2 F; nfq 2 Fg ð5Þ

Information on the fog nodes can be distributed horizontally or vertically. If thedistribution is horizontal the data, that is recorded by the sensors at each node has thesame metadata but is related to different objects. For example, there can be sensors forpressure, temperature, humidity etc. that would measure the parameters of similarobjects but, for example, be located in different regions.

If the distribution is vertical, the data recorded at each node is related to one orseveral parameters. Thus the data, which is stored at each node has different meta databut is usually related to a single object. In this case, the synchronization can beachieved through comparison of timestamps.

The suggested approach allows to easily transform sequential algorithms intoparallel processing on attributes or data vectors for both cases. The functional blocks ofthe algorithm that processes data can be moved to the fog nodes that store informationaccording to (5).

4 Experiments

Experiments for centralized and distributed IoT systems have been carried out. TheApache Spark MLlib was used for the centralized approach. It has been deployed onhigh-performance servers supporting hardware virtualization and providing the possi-bility to perform cloud computing. The following objects of the computing clusterinfrastructure were used for the experiments:

• two servers with following characteristics:– CPU - IntelXeon 2.9 GHz (2 CPU on 6 kernels, performance of calculations in 2

streams on a kernel, only 24 streams on the server), RAM - 128 GB,– CPU - Power7 3.3 GHz (2 CPU on 4 kernels, performance of calculations in 4

streams on a kernel, only 32 streams on the server), RAM - 128 GB,• two StorageSystemStorwizev700 with 13.6 TiB.

For the distributed approach, actors with the functional blocks of the algorithm 1Rwere distributed between the nodes of the system. As an example, a specific parallel formof the algorithm 1R was used in each of the data distribution types (Fig. 3). AlgorithmR1 was parallelized into vectors (3) for a system with a horizontal distribution (Fig. 3a)and into attributes (4) for the system with a vertical distribution (Fig. 3b).

32 I. Kholod et al.

The calculation functional blocks of the algorithm were run on the same cluster asSpark ML. Fog nodes were created as virtual machines on a high performance serverwith the following characteristics:

• Huawei FusionServer RH2288 V3 Rack Server with 2 Intel® Xeon® E5-2600v3/v4 processors, the volume of random access memory - 128 GB.

We performed experiments for systems on 2, 4 and 8 fog nodes. Each of thesenodes stored equal parts of the data set. The data sets from Azure ML were used for theexperiments. The parameters of the data sets are presented in Table 1.

The experimental results are provided in Table 2. Data loading time and analysistime were measured separately for the centralized systems. Therefore the total analysistime in such systems is the sum of loading- and analysis time. The results of theexperiments show that the total analysis time in such systems is higher than in dis-tributed systems. The reasons for this are:

Data mining services


Data gather layer

Cloud

Fog nodes

rulesCycle

vectorsCycle vectorsCycle

Devices layer

Cloud

Fog nodes

rulesCycle

attrCycleParall attrCycleParall

vectorsCycle

parallel parallel

a) b)

Fig. 3. Using the actors for execution of 1R algorithm in IoT system with fog nodes(a) horizontal data distribution, (b) vertical data distribution

Table 1. Experimental datasets

Input data set Number ofrows

Number ofattributes

Size of data(Kb)

Iris Two Class Data (ITCD) 100 4 2Telescope data (TD) 19 020 10 1 499Breast Cancer Info (BCI) 102 294 5 4 832Movie Ratings (MR) 227 472 4 6 055Flight on-time performance(Raw) (FOTP)

504 397 5 39 555

Flight Delays Data (FDD) 2 719 418 5 136 380


• moving the calculations closer to the data which does not require any time totransfer the data between the nodes,

• the algorithm structure optimization according to the distribution (horizontal orvertical).

Additionally the network traffic between the end devices and the cloud has beendecreased.

5 Conclusion

IoT can have a centralized or distributed architecture. Most of the existing data miningsolutions for IoT can work only within a centralized architecture. However, distributedarchitecture (Fog computing) became more popular since it allows to decrease networktraffic in a network with a large number of endpoint devices.

The paper describes an approach to building analytical services in IoT systems withdistributed architecture that uses distributed data analysis which is based on an actormodel. The decomposition of the algorithm into functional blocks and their mapping toactors allows to distribute the calculations between the nodes of an IoT system andobserve the following advantages:

• moving computing blocks that process the data to the nodes that store informationthus increasing data processing speed and decreasing network traffic,

• optimizing the structure of the algorithm depending on the data distribution (hori-zontal or vertical) and thus increasing data processing speed (parallelization).

The performed experiments demonstrated the efficiency of the suggested approach.The execution time for an algorithm, distributed between the nodes of an IoT systemconsidering the data storage places and a distribution type, is less than for the IoTsystems with a centralized architecture (the ones that use cloud analytical services andthe ones based on scaled data analysis platforms use). In future we plan to proposeautomated methods for estimation and distribution the functional blocks and the actorsin IoT systems with distributed architecture.

Acknowledgments. The work has been performed at the Saint Petersburg ElectrotechnicalUniversity “LETI” within the scope of the contract Board of Education of Russia and science ofthe Russian Federation under the contract № 02.G25.31.0058 from 12.02.2013. The paper hasbeen prepared within the scope of the state project “Organization of scientific research” of the

Table 2. Experimental results (s)

IoT system Action ITCD TD MR FOTP FDD

- Data set loading time 1 1 2 5 11Centralized withSpark MLib

Local data analysis 4 7 14 19 72Data loading and analysis 5 8 16 24 83

Distributed with horizontal distributing 4 7 14 20 74Distributed with vertical distributing 5 5 15 22 77

34 I. Kholod et al.

main part of the state plan of the Board of Education of Russia, the project part of the state planof the Board of Education of Russia (task 2.136.2014/K) as well as supported by grant of RFBR(projects 16-07-00625).

References

1. Evans, D.: The internet of things. how the next evolution of the internet is changingeverything. Cisco White Paper, Cisco Systems (2011)

2. Gartner Says the Internet of Things Installed Base Will Grow to 26 Billion Units By 2020.Gartner (2014)

3. Tsai, C.-W., Lai, C.-F., Vasilakos, A.V.: Future internet of things: open issues andchallenges. Wireless Netw. 20(8), 2201–2217 (2014)

4. Chen, F., Deng, P., Wan, J., Zhang, D., Vasilakos, A.V., Rong, X.: Data mining for theinternet of things: literature review and challenges. Int. J. Distrib. Sens. Netw. 2015, 14(2015). Article ID 431047. http://dx.doi.org/10.1155/2015/431047

5. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision,architectural elements, and future directions. Future Gener. Comput. Syst. 29, 1645–1660(2013)

6. Gronlund, C.J.: Introduction to machine learning on Microsoft Azure, 18 April 2016. https://azure.microsoft.com/en-gb/documentation/articles/machine-learning-what-is-machine-learning/

7. Barr, J.: Amazon Machine Learning – Make Data-Driven Decisions at Scale. AmazonMachine Learning, 18 April 2016. https://aws.amazon.com/ru/blogs/aws/amazon-machine-learning-make-data-driven-decisions-at-scale/

8. Google Cloud Machine Learning at Scale, 18 April 2016. https://cloud.google.com/products/machine-learning/

9. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai,D., Amde, M., Owen, S., Xin, D.: Mllib: Machine learning in apache spark (2015). arXivpreprint arXiv:1505.06807

10. Ingersoll, G.: Introducing apache mahout. Scalable, commercialfriendly machine learningfor building intelligent applications. IBM (2009)

11. Bonomi, F., Milito, R., Zhu, J., Addepalli, S.: Fog computing and its role in the internet ofthings. In: Processing of MCC, 17 August 2012, Helsinki, Finland, pp. 13–16 (2012)

12. Church, A., Barkley Rosser, J.: Some properties of conversion. Trans. AMS 39, 472–482(1936)

13. Kholod, I., Petukhov, I.: Creation of data mining algorithms as functional expression forparallel and distributed execution. In: Malyshkin, V. (ed.) PaCT 2015. LNCS, vol. 9251,pp. 62–67. Springer, Heidelberg (2015)

14. Kholod, I., Kupriyanov, M., Shorov, A.: Decomposition of data mining algorithms intounified functional blocks. Math. Probl. Eng. 2016, 11 (2016). Article ID 8197349

15. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets.Mach. Learn. 11, 63–90 (1993)

16. Kholod, I., Kuprianov, I., Petukhov, A.: Parallel and distributed data mining in cloud. In:Perner, P. (ed.) ICDM 2016. LNCS, vol. 9728, pp. 349–362. Springer, Heidelberg (2016).doi:10.1007/978-3-319-41561-1

17. Hewitt, C., Bishop, P., Steiger, R.: A universal modular actor formalism for artificialintelligence. In: IJCAI, pp. 235–245 (1973)


http://dx.doi.org/10.1155/2015/431047

https://azure.microsoft.com/en-gb/documentation/articles/machine-learning-what-is-machine-learning/



https://aws.amazon.com/ru/blogs/aws/amazon-machine-learning-make-data-driven-decisions-at-scale/

https://aws.amazon.com/ru/blogs/aws/amazon-machine-learning-make-data-driven-decisions-at-scale/

https://cloud.google.com/products/machine-learning/

https://cloud.google.com/products/machine-learning/

http://arXiv.org/abs/1505.06807

http://dx.doi.org/10.1007/978-3-319-41561-1

18. Kholod, I., Petuhov, I., Kapustin, N.: Creation of data mining cloud service on the actormodel. In: Balandin, S., Andreev, S., Koucheryavy, Y. (eds.) NEW2AN/ruSMART 2015.LNCS, vol. 9247, pp. 585–598. Springer, Heidelberg (2015)

19. Kholod, I.: Framework for multi threads execution of data mining algorithms. In: Proceedingof 2015 IEEE North West Russia Section Young Researchers in Electrical and ElectronicEngineering Conference, (2015 ElConRusW), pp. 74–80. IEEE Xplore (2015)

36 I. Kholod et al.

data mining for the internet of things with fog...

Documents