welcome message from the chairs -...

Homepage International Program Committee Copyright Contents Author Index

Welcome Message from the ChairsThe Steering Committee and the Programme Committee has the pleasure to welcome the attendees to participate in the Fourthedition of the International Conference on the Innovative Computing Technology (INTECH 2014) which is dedicated to addressing thechallenges in computing technologies for the new generation.

Continuing with the experience provided by the last three editions, INTECH 2014 is held entirely, to emphasize the need for producingnewer aspects of computing technologies with distinct focus on innovation. The conference embraces the latest innovative computingtechnology that will allow attendees to discuss the experience on newer aspects. INTECH is characterized by the features of full-fledged paper submissions, review, and publication process that adheres to the high standards defined by IEEE. The INTECH 2014conference explores new advances in new computing technologies that address new themes and innovative applications. It bringstogether researchers from various specialities in computer and information sciences who address both theoretical and appliedaspects of computing technology and its applications. We do hope that the discussions and exchange of ideas will contribute to theadvancements in the technology in the near future.

The conference has recorded the receipt of 169 papers, out of which 41 were accepted, resulting an acceptance rate of 23%. Theseaccepted papers are authored by researchers from many countries covering many significant areas of computing technology. Eachpaper is evaluated by a minimum of three reviewers.

Finally, we hope that the conference fulfils your expectations and hope that the proceedings document the best research in thestudied areas. We express our thanks to the IEEE UK & RI and the Society for Information Organization, UK, the DLINE database,the authors and the organization of the conference.

General ChairEzendu Ariwa, Bedfordshire University, UKProgram ChairsSimon Fong, University of Macau, MacauChing-Hsien Hsu, Chung Hua University, TaiwanAziz El Janati El Idrissi, Mohammed V Agdal University, MoroccoProgram Co-chairsYuxin Mao, Zhejiang Gongshang University, ChinaYousef Ibrahim, Misurata University, Libya

Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyondthe limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page,provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA

Homepage International Program Committee Copyright Contents Author Index

Contents

WWW Applications and TechnologiesComputational Intelligence, Soft Computing and

Optimization algorithms

Data and Network Mining XML and other Extensible Languages

Network and Information Security Cloud Computing

Applied Information Systems Mobile Network and Systems

Data Stream Processing, Mobile/Sensor Networks and Signal Processing

WWW Applications and Technologies

Finite State Machine based Flow Analysis for WebRTC ApplicationsSergej Alekseev, Christian von Harscher, Marco Schindler

Enrichment of Learner Profile through the Semantic Analysis of the Learner’s Query on the WebSamia Ait Adda, Balla Amar

Healthcare-Event Driven Semantic Knowledge Extraction with Hybrid Data RepositoryHong Qing Yu, Xia Zhao, Xin Zhen, Feng Dong, Enjie Liu, Gordon Clapworthy

TOP

Computational Intelligence, Soft Computing and Optimization algorithms

ANFIS Based a Two-Phase Interleaved Boost Converter for Photovoltaic System Donny Radianto, Masahito Shoyama

Scheduling Algorithms for Video Multiplexers in Surveillance SystemsKuan Jen Lin, Tsai Kun Hou

Cost Optimization based on Brain Storming for Grid SchedulingMaria Arsuaga-Rios, Miguel A. Vega-Rodriguez

A Study on the Development of Diagnosis Algorithm and Application Program for Early Diagnosis of Cervical Cancer using Cervix CellHan Yeong Oh, Seong Hyun Kim, Dong Wook Kim

Easy to Calib: Auto-Calibration of Camera from Sequential Images Based on VP and EKF

Yu Song, Fei Wang, Haiwei Yang, Sheng Gao

Image Imputation Based on Clustering Similarity ComparisonSathit Prasomphan

TOP

Data and Network Mining

Volume Based Anomaly Detection using LRD Analysis of Decomposed Network TrafficKhan Zeb, Basil AsSadhan, Jalal Al-Muhtadi, Saleh Alshebeili, Abdulmuneem Bashaiwth

Design and Implementation of Data Warehouse with Data Model using Survey-based Services DataBoon Keong Seah, Nor Ezam Selan

Evaluating Textual Approximation to Classify Moving Object TrajectoriesHuy Xuan DO, Hung-Hsuan Huang, Kyoji Kawagoe

TOP

XML and other Extensible Languages

An XSLT Transformation Method for Distributed XMLHiroki Mizumoto, Nobutaka Suzuki

A Scalable XML Indexing Method Using MapReduceWen-Chiao Hsu, Hsiao-Chen Shih, I-En Liao

TOP

Network and Information Security

A Methodology of Assessing Security Risk of Cloud Computing in User Perspective for Security-Service-Level AgreementsSang-Ho Na, Eui-Nam Huh

Improving the Performance of Network Traffic Prediction for Academic Organization by Using Association Rule MiningDulyawit Prangchumpol

Risks in Smart Environments and Adaptive Access ControlsMariagrazia Fugini, Mahsa Teimourikia

MAR(S)2: Methodology to Articulate the Requirements for Security In SCADATanaya Gopal, Madhuri Subbaraju, Rashmi vivek Joshi, Somdip Dey

TOP

Cloud Computing

A Rule-based Data Grouping Method for Personal Log Analysis System in Big Data ComputingYong-Hyun Kim, Eui-Nam Huh

Optimization for Reasonable Service Price in Broker based Cloud Service EnvironmentYoung-Rok Shin, Eui-Nam huh

A Rule-based Data Grouping Method for

Personalized Log Analysis System in Big Data

Computing

Yong-Hyun Kim

Dept. of Computer Engineering

Kyung Hee University

Yongin, South Korea

[email protected]

Eui-Nam Huh

Dept. of Computer Engineering

Kyung Hee University

Yongin, South Korea

[email protected]

Abstract—Nowadays, providing personalized service to customers

is one of the main issues in big data services. To provide the

personalized service, analyzing various logs and cooperation

between data analysts and developers are critical. However, the

problem is that overhead can occur when the log data is analyzed

due to general characteristics of big data system as well-known

4Vs(Velocity, Various, Value and Volume). Also, generally it is hard

for data analysts and developers to work together because they use

different interfaces. Therefore, we propose a personalized log

analysis system including rule-based data grouping method in order

for the improved performance of personalized log analysis and more

flexible cooperation between data analysts and developers. The

evaluation of the proposed system performs well for cooperation and

grouping along with the R SW tool.

Index Terms—Personalized log analysis, Big Data, NoSQL, R,

log analysis

I. INTRODUCTION

Recently, big data trend is moving towards to provide

personalized services to customers. In other words, main issue

of big data is how the customized information will be provided

to customers using various inside structured data or semi-

structured data such as logs and outside non-structured data

such as SNS [1]. To provide personalized advertisements

service, for example, an advertising agency might collect and

analyze logs, which could be interpreted as customer’s habitual

behavior information, such as email opening time and

advertisement link click time. Therefore, efficient processing

and analyzing of various logs data in big data system are the

most important aspects for the personalized big data service.

However, there are a lot of problems impeded efficiency

such as difference of each information schema and un-

optimized big data analysis system for an analyst. When it

comes to various schemas of log data, most of IT companies

use relational database such as MySQL, and it makes big data

system difficult to manage and analyze [2]. Relational database

has tables that uses fixed fields (also called schema). If

unknown table schema in log data such as non-structured data

is collected, relational database has to create new table or

interfere log data to existing schema roughly. It could be

overhead and lead to inaccurate analysis. In the case of

personalized log analysis, log data of several aspects and

business purpose might be considered as distinct from general

statistical analysis [3,4]. Companies, thus, need to consider

database that can handle customer log data of various schemas.

In this paper, we use MongoDB, one of Not Only SQL

(NoSQL), to manage various schemas of log data because

NoSQL have non-structured data structure and can perform

query in high performance [5].

Moreover, analysis accuracy is relatively low due to lack of

professional analysis tool that commonly used to analyst such

as SAS and R. Data management system such as MySQL and

Hadoop is generally used for statistical log analysis in the

existing systems. However, data management system is not

enough to perform personalized analysis. Attempting diversity

analysis is important to determine suitable analysis method in

personalized analysis. Therefore, if big data analysis system

provides R interface in Cloud computing, data analyst can

perform various analyses from the previously experienced

analysis and analyze log data with advanced analysis method

rather than Hadoop based analysis. Future big data analysis

might consider above issues and need to prepare NoSQL and

professional analysis tool to handle various schemas of log data.

In this study, we propose MongoDB and R based

personalized log analysis System (PLAS) and a rule-based data

grouping method to improve performance of personalized log

analysis by solving above problems. The PLAS with data

grouping method can handle various schemas of log data in

low overhead and attempt diversity analysis in R interface for

the personalized service. We apply our proposed system to real

online marketing system based on email and perform an

experiment using practical log data of the online marketing

company, Bizisolution in Korea.

The remainder of this paper is organized as follows:

Section II describes related works regarding NoSQL and Data

Pre-processing. Section III proposes the PLAS system and a

978-1-4799-4233-6/14/$31.00 ©2014 IEEE 109

rule-based data grouping method. A numerical experiment for

performance evaluation showed in section IV. Section V,

finally, summarizes this study and explains the future works.

II. RELATED RESEARCH

A. NoSQL

NoSQL handles various data types such as structured data,

semi-structured data and non-structured data. Moreover, one of

the key advantages of NoSQL is providing high scalability.

NoSQL can be scaled up according to NoSQL classification

such as key-value store, column store, document store and

graph store [6]. The key-value store manages a paired data with

key and value. The column store manages fields and values as

a key-value data. The document store manages any kind of data

to a document. And the graph store manages paired data

include relations among each data. To construct optimized

analysis system, developers need to select one of the NoSQL

types according to log data type and analysis method.

B. MongoDB and R

MongoDB is one of well-known document-based NoSQL.

Some researchers studied to develop big data analysis system

using MongoDB in the Cloud computing [7,8]. MongoDB has

some characteristics such as collection and schema-free. The

collection is logical unit in MongoDB database. Log data is

grouped as a collection in the proposed rule-based data

grouping method. And developers can store BSON (Binary

Java Script Oriented Notation) log documents of various

schemas. MongoDB has high performance of input and output

data while map/reduce is low performance.

R is statistical computing language derived from S

language. About 5600 packages are available in CRAN

package repository. Many analysts analyze data in various

statistical R packages. Developed R algorithms and functions

can be easily packaged by provided functions. R is suited to

analyze in real-time system due to always working in memory.

On the other hand, R is difficult to process large log data in

limited memory space at once.

Due to pros and cons as above, we complementarily use

MongoDB and R in the PLAS. Unlike R, MongoDB is

inadequate for analysis while MongoDB can manage various

schemas of log data efficiently. In particular, MongoDB has

high compatibility with R using packages such as ‘rmongodb’.

Therefore, we propose the PLAS based on MongoDB and R.

C. Data Pre-processing

Data pre-processing technologies considered one of the

important steps in data mining area. The data pre-processing is

classified with data refinement, data integration, data reduction

and data transformation. Data refinement is handling process of

missing value or noise. Data integration is integrating process

of duplicated data. Data reduction is reducing process of data

size by combining into a data cube. And Data transformation is

normalizing process such as general ETL (Extraction /

Transformation / Loading) tools. Data pre-processing might be

performed before data analysis step due to prevent errors in

data analysis and improve performance and accuracy of data

analysis [9].

Therefore, log data might be pre-processed to decrease

analysis errors in the PLAS. However, if the PLAS performs

large log data pre-process at once, overhead can occur due to

the data pre-process needs to read log data from database.

Therefore, we propose a rule-based data grouping method to

decrease overhead by making the shortest distance of the query

path.

III. PERSONALIZED LOG ANALYSIS SYSTEM (PLAS)

The proposed PLAS consist of Dashboard, Data Grouper,

Pre-processor and Big Answer Verifier. There are serial

procedures from the dashboard to the big answer verifier. The

flow diagram of the PLAS is shown in Fig. 1. Analyst is able to

configure and control such as analyzing method, grouping rule

and MongoDB query in the dashboard. And collected log data

is stored and analyzed according to the serial processing. First,

the data grouper stores collected log data by the grouping rule.

The pre-processor performs data mining and workload

optimizations. Next, the distributed process manager allocates

jobs to analysis modules of each node by using optimized

workflow. Finally, the big answer verifier decides whether

analysis result is big answer or not.

A. System Architecture

1) Dashboard

Dashboard of the PLAS performs Rule Management,

Workflow Management and System Monitoring. Data analyst

manages rules regarding data grouping and log data analysis.

The data grouping rule includes attributes as grouping criterion

and handling methods of missing values. And the analysis rule

includes analysis information such as analysis methods and

maximum node count. The data analyst creates workflow in the

dashboard. The workflow is depending on log data for analysis,

log type, analysis purpose, system environment, etc. Also,

workflow metadata is stored to optimize the workflow in the

pre-processor step. The workflow metadata has information

such as total processing time, analyzed log counts, and

allocated node counts.

2) Data Grouper

Fig. 1. Personalized log analysis System Flow Diagram

110

Data grouper performs collected log data grouping for

personalized log analysis and stores log data to MongoDB.

Collected log data is handled as shown in Fig. 2

The data grouper handles various schemas of log data in serial

process such as log parsing, rule matching, rule creation, group

finding and log data insertion. When unknown log data is

collected, data grouper creates temporary rules and groups

called collection in MongoDB. Data grouper stores log data by

using rule-based data grouping method.

Therefore, the data grouper stores collected log data

according to data grouping rule and supports high performance

log analysis. More detail regarding the rule-based data

grouping method will be described in section 3.C

3) Pre-processor

Pre-processor performs data mining and workflow

optimization. The data mining is one of general data pre-

processing methods. For example, the pre-processor checks the

log data to handle missing values and noise. And the workflow

optimization is processing to optimize workflow according to

log type, amount of log data and available analysis module

counts. The workflow includes distributed analysis method, log

data summarization and a proper analysis method. Using the

workflow optimization, the PLAS can maintain optimized

analysis performance.

4) Distributed Process Manager

Distributed process manager (DPM) manages statuses of

each node such as database list and R connections. The DPM

performs database searching and scheduling. According to

optimized workflow, the DPM searches databases that logs are

stored. And the DPM schedules workflows for load balance.

In this paper, we consider MongoDB and R to construct the

DPM. MongoDB is only used for managing log data.

Distributed analysis is performed by using R. Therefore,

MongoDB and R are installed in each node and work

complementarily for distributed analysis. To decrease overhead

of database searching, the DPM stores database and collection

list of each MongoDB. And R can communiate as relation of

server-client by using ‘RServe’ and ‘RSclient’ packages.

Hence, we can construct distributed analysis environment by

using MongoDB and R.

5) Big Answer Verifier

Big answer implies significant analysis result in the big

data analysis. The big answer verifier decides analysis result

regarding ‘Is it significant result?’ in the PLAS. Therefore, big

answer verifier can support decision of data analyst by

convergence or cluster analysis of personalized log analysis

result. The big answer verifier creates metadata regarding

analysis process. The metadata includes informations such as

total processing time, analysis environment, analysis request

information, etc. The PLAS uses the metadata as analysis

history to optimize system and improve accuracy of big answer

decision result.

By using the proposed PLAS, IT companies can construct

optimized analysis environment and provide analyst-friendly

system. Moreover, developers and analysts can cooperate by

using the PLAS.

B. Performance Influencing Factors of the PLAS

When it comes to the performance of PLAS, total

processing time is able to be considered. Total processing time

is decided by four elements such as pre-processing time,

distributed process managing time, total analyzing time and big

answer verifying time. Hence, performance influencing factors

are considered as shown in Eq. 1 and each notation is presented

in Table 1.

Ttotal = Tp + Td + Ta + Tv. (1)

TABLE I. NOTATIONS AND EXPLANATIONS OF EQUATION 1

Notation Explanation

Ttotal Total Processing Time

Tp Pre-processing Time

Td Distributed Process Managing Time

Ta Total Analyzing Time

Tv Big Answer Verifying Time

Distributed processing time, although depends on existing

big data system and big answer verifier is not affected by

difference of query path distance. Thus, we suppose that above

two elements have same value, and the Eq. 1, then, represented

as below:

Ttotal = Tp + Ta. (2)

As shown in Eq. 2, pre-processing time (Tp) and total

analyzing time (Ta) are important performance influencing

factors. And each factor is relevant to query path. In other

words, data grouper is the most important step due to the query

path that decided by storing method of data grouper. That is

why we propose rule-based data grouping method.

Fig. 2. Flow Diagram of Data Grouper

111

C. Rule-based Data Grouping Method

The proposed rule-based data grouping method is to

decrease overhead of data pre-processing. Heavy overhead can

occur in personalized log analysis rather than general analysis

due to query path distance. Hence, in the personalized log

analysis, log data might be stored using rule-based data

grouping method to provide shortest distance of query path.

We introduce the rule-based data grouping method in detail by

using an example.

For example, email log used in this paper has two types:

Send log and Web log. The send log is created by sending

email to customer. The send log has information regarding

sending time, customer’s email address, sending result, etc.

The web log is created when customer open the email or click

some advertisement in the email. The web log has information

related to click/non-click advertisement link, clicking time of

advertisement, reading time of email, etc.

In the most case of email log analysis, analysts analyze the

whole email logs for total email sending success rate, sending

failure rate, error codes, etc. As shown, however, personalized

log analysis is totally different from general email log analysis.

In the personalized log analysis, analysts have known business

goals. Analysts interest each customer’s information rather

than email sending success rate or error codes. They will

analyze each customer’s email log related to ‘which time the

customer reads email?’ or ‘what are customer’s interests in

advertisement?’ etc. Therefore, grouping rule of this example

can be customer’s email address and click advertisement link

information due to providing personalized service to each

customer. Service can provides email with personalized

advertisements by using the grouping rule and the PLAS.

As mentioned in above example, various information of

collected log data is considered by analyst in rule-based data

grouping method. The proposed PLAS and rule-based data

grouping method contribute as follows.

D. Summary of Contributions

Personalized log analysis focuses on each customer.

However, general database structure requires the query in the

whole database to find only one customer’s information.

Therefore, overhead can occur in the query processing. To

solve this problem, rule-based data grouping method employs

grouping and storing collected log data by the rule that is

defined by data analyst.

Data analyst defines attribute for grouping rule. Grouping

rule requires considering type of collected data and purpose of

personalized log analysis related business when data analyst

defines it, and then collected log data will be grouped by the

grouping rule and stored in database. In the PLAS, we select

MongoDB that can store log data schema independently. If

some logs have relations among them, logs are stored in a

collection of MongoDB.

Therefore, the proposed PLAS with rule-based data

grouping method contributes to decrease overhead regarding

pre-processing by using shortest distance of query path and

enhance the performance of personalized log analysis since

data analyst perform personalized log analysis only part of

collections related analysis purpose.

IV. NUMERICAL ANALYSIS

Numerical analysis of rule-based data grouping method is

shown in Fig. 3 to Fig. 6. To experiment using practical log

data, We used 20,000 ~ 320,000 email log data of the online

marketing company, Bizisolution in Korea. We evaluate

performance between general storing method and attribute

grouping method in single node MongoDB. Attribute grouping

method stores log data by using rule-based data grouping

method. In the general storing method, log data classified by

log type such as send log and web log. In the attribute grouping

method, log data classified by attributes such as customer’s

email, age, gender, etc. Also, we performed general statistical

analysis for each customer such as email sending success rate,

clicked link type, etc.

A. Total Processing Time

Fig. 3 shows total processing time by number of log data in

each data storing method. Total processing time includes

storing time, pre-processing time and analyzing time as Eq. 2.

Attribute grouping method is shown high performance in every

case. In more than 80,000 of log data case, total processing

time increased both cases owing to log data is evaluated in

single node MongoDB. If we evaluate methods in distributed

process, total processing time will be shorten. However,

attribute grouping case still shows high performance in more

than 80,000 of log data. We find reason of performance

difference in functional analysis.

B. Functional Analysis

To find reason of performance difference between data

storing methods, we perform functional analysis regarding

processing time of pre-processing and analysis modules. We

evaluate processing time of each function as shown in Fig. 4 to

Fig. 6.

Fig. 3. Total Processing time of data storing methods

112

Fig. 4 shows storing time of each storing method. Fig. 5

and Fig. 6 show processing time of each step in case of 20,000

and 320,000 log data. Each step is storing send log, storing web

log and analysis including pre-processing. In the Fig. 4,

Although there are some overhead in the storing process due to

large log data is stored in single node MongoDB at once to

simulate data grouping method, the overhead can be decreased

in real-time processing environment. And Attribute grouping

method required more time to store log data due to log

reconstruction by attribute grouping. However, attribute

grouping case shows high performance in the whole process as

Fig. 5 and Fig. 6. Performance enhanced about 57~88% using

rule-based data grouping method. As shown in this result, if log

data is stored by using attribute grouping method, personalized

log analysis performance can be enhanced. In other words,

rule-based data grouping method makes shorter query path

distance rather than general storing method.

For example, we have email log consist of send log and

web log. We can simply think query path to find customer

information. The rule-based data grouping method needs only

one step query path such as ‘database – user collection’. On the

other hand, general storing method needs more than four steps

query path such as ‘database – send collection – user data and

database – web collection – user data’. Hence, rule-based data

grouping method can improves analysis performance by

making shortest query path.

V. CONCLUSION

In this paper, we proposed the personalized log analysis

system and rule-based data grouping method to improve

personalized analysis performance. In particular, rule-based

data grouping method decreases query overhead by shortest

distance of query path in personalized log analysis.

Although the PLAS has storing overhead, entirely system

performance is improved by enhancing analysis performance

using the data grouper which makes shorter query path distance.

The storing overhead, moreover, can be decreased in real-time

collecting and analyzing environment unlike our experiment

environments. In the future, we will proceed step-by-step

studies to improve the PLAS with real-time data grouping,

workflow optimization, distributed analysis in MongoDB and

R.

ACKNOWLEDGMENT

This research was supported by the MSIP (Ministry of

Science, ICT&Future Planning), Korea, under the ITRC

(Information Technology Research Center) support program

(NIPA-2014(H0301-14-1020)) supervised by the NIPA

(National IT Industry Promotion Agency). Corresponding

Author: Eui-Nam Huh.

REFERENCES

[1] James Manyika, et al, “Big data: The next frontier for

innovation, competition, and productivity”, McKinsey Global

Institute, June 2011.

[2] Kandel, Sean, et al. “Enterprise data analysis and visualization:

An interview study”, IEEE VCG, 2012.

[3] Chen, Hsinchun, Roger HL Chiang, and Veda C. Storey.

"Business Intelligence and Analytics: From Big Data to Big

Impact." MIS Quarterly 36.4, 2012.

Fig. 6. Processing time of 320,000 logs

Fig. 5. Processing time of 20,000 logs

Fig. 4. Total storing time of data storing methods

113

[4] Alam, Jafar Raza, et al. "A Review on the Role of Big Data in

Business”, IJCSMC, 2014.

[5] Pääkkönen, Pekka, and Daniel Pakkala. “Report on Scalability

of database technologies for entertainment services”, nextMedia,

December 2011.

[6] Tudorica, Bodgan George, and Cristian Bucur. “A comparison

between several NoSQL databases with comments and notes”,

Roedunet International Conference (RoEduNet), IEEE, 2011.

[7] Myoungjin Kim, et al. "Design and Implementation of

MongoDB-based Unstructured Log Processing System over

Cloud Computing Environment", Journal of Internet Computing

and Services (JICS) 14.6, 2013, pp.71-84.

[8] Dede, Elif, et al. “Performance evaluation of a mongodb and

hadoop platform for scientific data analysis”, Proceedings of the

4th ACM workshop on Scientific cloud computing, 2013.

[9] Yang Ji Lee, et al. “A Study on Data Pre-filtering Methods for

Fault Diagnosis”, Transactions of the Society of CAD/CAM

Engineers 17.2, 2012, pp.97-110.

114

welcome message from the chairs -...

Documents