welcome message from the chairs -...
TRANSCRIPT
Homepage International Program Committee Copyright Contents Author Index
Welcome Message from the ChairsThe Steering Committee and the Programme Committee has the pleasure to welcome the attendees to participate in the Fourthedition of the International Conference on the Innovative Computing Technology (INTECH 2014) which is dedicated to addressing thechallenges in computing technologies for the new generation.
Continuing with the experience provided by the last three editions, INTECH 2014 is held entirely, to emphasize the need for producingnewer aspects of computing technologies with distinct focus on innovation. The conference embraces the latest innovative computingtechnology that will allow attendees to discuss the experience on newer aspects. INTECH is characterized by the features of full-fledged paper submissions, review, and publication process that adheres to the high standards defined by IEEE. The INTECH 2014conference explores new advances in new computing technologies that address new themes and innovative applications. It bringstogether researchers from various specialities in computer and information sciences who address both theoretical and appliedaspects of computing technology and its applications. We do hope that the discussions and exchange of ideas will contribute to theadvancements in the technology in the near future.
The conference has recorded the receipt of 169 papers, out of which 41 were accepted, resulting an acceptance rate of 23%. Theseaccepted papers are authored by researchers from many countries covering many significant areas of computing technology. Eachpaper is evaluated by a minimum of three reviewers.
Finally, we hope that the conference fulfils your expectations and hope that the proceedings document the best research in thestudied areas. We express our thanks to the IEEE UK & RI and the Society for Information Organization, UK, the DLINE database,the authors and the organization of the conference.
General ChairEzendu Ariwa, Bedfordshire University, UKProgram ChairsSimon Fong, University of Macau, MacauChing-Hsien Hsu, Chung Hua University, TaiwanAziz El Janati El Idrissi, Mohammed V Agdal University, MoroccoProgram Co-chairsYuxin Mao, Zhejiang Gongshang University, ChinaYousef Ibrahim, Misurata University, Libya
Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyondthe limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page,provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA
Homepage International Program Committee Copyright Contents Author Index
Contents
WWW Applications and TechnologiesComputational Intelligence, Soft Computing and
Optimization algorithms
Data and Network Mining XML and other Extensible Languages
Network and Information Security Cloud Computing
Applied Information Systems Mobile Network and Systems
Data Stream Processing, Mobile/Sensor Networks and Signal Processing
WWW Applications and Technologies
Finite State Machine based Flow Analysis for WebRTC ApplicationsSergej Alekseev, Christian von Harscher, Marco Schindler
Enrichment of Learner Profile through the Semantic Analysis of the Learner’s Query on the WebSamia Ait Adda, Balla Amar
Healthcare-Event Driven Semantic Knowledge Extraction with Hybrid Data RepositoryHong Qing Yu, Xia Zhao, Xin Zhen, Feng Dong, Enjie Liu, Gordon Clapworthy
TOP
Computational Intelligence, Soft Computing and Optimization algorithms
ANFIS Based a Two-Phase Interleaved Boost Converter for Photovoltaic System Donny Radianto, Masahito Shoyama
Scheduling Algorithms for Video Multiplexers in Surveillance SystemsKuan Jen Lin, Tsai Kun Hou
Cost Optimization based on Brain Storming for Grid SchedulingMaria Arsuaga-Rios, Miguel A. Vega-Rodriguez
A Study on the Development of Diagnosis Algorithm and Application Program for Early Diagnosis of Cervical Cancer using Cervix CellHan Yeong Oh, Seong Hyun Kim, Dong Wook Kim
Easy to Calib: Auto-Calibration of Camera from Sequential Images Based on VP and EKF
Yu Song, Fei Wang, Haiwei Yang, Sheng Gao
Image Imputation Based on Clustering Similarity ComparisonSathit Prasomphan
TOP
Data and Network Mining
Volume Based Anomaly Detection using LRD Analysis of Decomposed Network TrafficKhan Zeb, Basil AsSadhan, Jalal Al-Muhtadi, Saleh Alshebeili, Abdulmuneem Bashaiwth
Design and Implementation of Data Warehouse with Data Model using Survey-based Services DataBoon Keong Seah, Nor Ezam Selan
Evaluating Textual Approximation to Classify Moving Object TrajectoriesHuy Xuan DO, Hung-Hsuan Huang, Kyoji Kawagoe
TOP
XML and other Extensible Languages
An XSLT Transformation Method for Distributed XMLHiroki Mizumoto, Nobutaka Suzuki
A Scalable XML Indexing Method Using MapReduceWen-Chiao Hsu, Hsiao-Chen Shih, I-En Liao
TOP
Network and Information Security
A Methodology of Assessing Security Risk of Cloud Computing in User Perspective for Security-Service-Level AgreementsSang-Ho Na, Eui-Nam Huh
Improving the Performance of Network Traffic Prediction for Academic Organization by Using Association Rule MiningDulyawit Prangchumpol
Risks in Smart Environments and Adaptive Access ControlsMariagrazia Fugini, Mahsa Teimourikia
MAR(S)2: Methodology to Articulate the Requirements for Security In SCADATanaya Gopal, Madhuri Subbaraju, Rashmi vivek Joshi, Somdip Dey
TOP
Cloud Computing
A Rule-based Data Grouping Method for Personal Log Analysis System in Big Data ComputingYong-Hyun Kim, Eui-Nam Huh
Optimization for Reasonable Service Price in Broker based Cloud Service EnvironmentYoung-Rok Shin, Eui-Nam huh
A Rule-based Data Grouping Method for
Personalized Log Analysis System in Big Data
Computing
Yong-Hyun Kim
Dept. of Computer Engineering
Kyung Hee University
Yongin, South Korea
Eui-Nam Huh
Dept. of Computer Engineering
Kyung Hee University
Yongin, South Korea
Abstract—Nowadays, providing personalized service to customers
is one of the main issues in big data services. To provide the
personalized service, analyzing various logs and cooperation
between data analysts and developers are critical. However, the
problem is that overhead can occur when the log data is analyzed
due to general characteristics of big data system as well-known
4Vs(Velocity, Various, Value and Volume). Also, generally it is hard
for data analysts and developers to work together because they use
different interfaces. Therefore, we propose a personalized log
analysis system including rule-based data grouping method in order
for the improved performance of personalized log analysis and more
flexible cooperation between data analysts and developers. The
evaluation of the proposed system performs well for cooperation and
grouping along with the R SW tool.
Index Terms—Personalized log analysis, Big Data, NoSQL, R,
log analysis
I. INTRODUCTION
Recently, big data trend is moving towards to provide
personalized services to customers. In other words, main issue
of big data is how the customized information will be provided
to customers using various inside structured data or semi-
structured data such as logs and outside non-structured data
such as SNS [1]. To provide personalized advertisements
service, for example, an advertising agency might collect and
analyze logs, which could be interpreted as customer’s habitual
behavior information, such as email opening time and
advertisement link click time. Therefore, efficient processing
and analyzing of various logs data in big data system are the
most important aspects for the personalized big data service.
However, there are a lot of problems impeded efficiency
such as difference of each information schema and un-
optimized big data analysis system for an analyst. When it
comes to various schemas of log data, most of IT companies
use relational database such as MySQL, and it makes big data
system difficult to manage and analyze [2]. Relational database
has tables that uses fixed fields (also called schema). If
unknown table schema in log data such as non-structured data
is collected, relational database has to create new table or
interfere log data to existing schema roughly. It could be
overhead and lead to inaccurate analysis. In the case of
personalized log analysis, log data of several aspects and
business purpose might be considered as distinct from general
statistical analysis [3,4]. Companies, thus, need to consider
database that can handle customer log data of various schemas.
In this paper, we use MongoDB, one of Not Only SQL
(NoSQL), to manage various schemas of log data because
NoSQL have non-structured data structure and can perform
query in high performance [5].
Moreover, analysis accuracy is relatively low due to lack of
professional analysis tool that commonly used to analyst such
as SAS and R. Data management system such as MySQL and
Hadoop is generally used for statistical log analysis in the
existing systems. However, data management system is not
enough to perform personalized analysis. Attempting diversity
analysis is important to determine suitable analysis method in
personalized analysis. Therefore, if big data analysis system
provides R interface in Cloud computing, data analyst can
perform various analyses from the previously experienced
analysis and analyze log data with advanced analysis method
rather than Hadoop based analysis. Future big data analysis
might consider above issues and need to prepare NoSQL and
professional analysis tool to handle various schemas of log data.
In this study, we propose MongoDB and R based
personalized log analysis System (PLAS) and a rule-based data
grouping method to improve performance of personalized log
analysis by solving above problems. The PLAS with data
grouping method can handle various schemas of log data in
low overhead and attempt diversity analysis in R interface for
the personalized service. We apply our proposed system to real
online marketing system based on email and perform an
experiment using practical log data of the online marketing
company, Bizisolution in Korea.
The remainder of this paper is organized as follows:
Section II describes related works regarding NoSQL and Data
Pre-processing. Section III proposes the PLAS system and a
978-1-4799-4233-6/14/$31.00 ©2014 IEEE 109
rule-based data grouping method. A numerical experiment for
performance evaluation showed in section IV. Section V,
finally, summarizes this study and explains the future works.
II. RELATED RESEARCH
A. NoSQL
NoSQL handles various data types such as structured data,
semi-structured data and non-structured data. Moreover, one of
the key advantages of NoSQL is providing high scalability.
NoSQL can be scaled up according to NoSQL classification
such as key-value store, column store, document store and
graph store [6]. The key-value store manages a paired data with
key and value. The column store manages fields and values as
a key-value data. The document store manages any kind of data
to a document. And the graph store manages paired data
include relations among each data. To construct optimized
analysis system, developers need to select one of the NoSQL
types according to log data type and analysis method.
B. MongoDB and R
MongoDB is one of well-known document-based NoSQL.
Some researchers studied to develop big data analysis system
using MongoDB in the Cloud computing [7,8]. MongoDB has
some characteristics such as collection and schema-free. The
collection is logical unit in MongoDB database. Log data is
grouped as a collection in the proposed rule-based data
grouping method. And developers can store BSON (Binary
Java Script Oriented Notation) log documents of various
schemas. MongoDB has high performance of input and output
data while map/reduce is low performance.
R is statistical computing language derived from S
language. About 5600 packages are available in CRAN
package repository. Many analysts analyze data in various
statistical R packages. Developed R algorithms and functions
can be easily packaged by provided functions. R is suited to
analyze in real-time system due to always working in memory.
On the other hand, R is difficult to process large log data in
limited memory space at once.
Due to pros and cons as above, we complementarily use
MongoDB and R in the PLAS. Unlike R, MongoDB is
inadequate for analysis while MongoDB can manage various
schemas of log data efficiently. In particular, MongoDB has
high compatibility with R using packages such as ‘rmongodb’.
Therefore, we propose the PLAS based on MongoDB and R.
C. Data Pre-processing
Data pre-processing technologies considered one of the
important steps in data mining area. The data pre-processing is
classified with data refinement, data integration, data reduction
and data transformation. Data refinement is handling process of
missing value or noise. Data integration is integrating process
of duplicated data. Data reduction is reducing process of data
size by combining into a data cube. And Data transformation is
normalizing process such as general ETL (Extraction /
Transformation / Loading) tools. Data pre-processing might be
performed before data analysis step due to prevent errors in
data analysis and improve performance and accuracy of data
analysis [9].
Therefore, log data might be pre-processed to decrease
analysis errors in the PLAS. However, if the PLAS performs
large log data pre-process at once, overhead can occur due to
the data pre-process needs to read log data from database.
Therefore, we propose a rule-based data grouping method to
decrease overhead by making the shortest distance of the query
path.
III. PERSONALIZED LOG ANALYSIS SYSTEM (PLAS)
The proposed PLAS consist of Dashboard, Data Grouper,
Pre-processor and Big Answer Verifier. There are serial
procedures from the dashboard to the big answer verifier. The
flow diagram of the PLAS is shown in Fig. 1. Analyst is able to
configure and control such as analyzing method, grouping rule
and MongoDB query in the dashboard. And collected log data
is stored and analyzed according to the serial processing. First,
the data grouper stores collected log data by the grouping rule.
The pre-processor performs data mining and workload
optimizations. Next, the distributed process manager allocates
jobs to analysis modules of each node by using optimized
workflow. Finally, the big answer verifier decides whether
analysis result is big answer or not.
A. System Architecture
1) Dashboard
Dashboard of the PLAS performs Rule Management,
Workflow Management and System Monitoring. Data analyst
manages rules regarding data grouping and log data analysis.
The data grouping rule includes attributes as grouping criterion
and handling methods of missing values. And the analysis rule
includes analysis information such as analysis methods and
maximum node count. The data analyst creates workflow in the
dashboard. The workflow is depending on log data for analysis,
log type, analysis purpose, system environment, etc. Also,
workflow metadata is stored to optimize the workflow in the
pre-processor step. The workflow metadata has information
such as total processing time, analyzed log counts, and
allocated node counts.
2) Data Grouper
Fig. 1. Personalized log analysis System Flow Diagram
110
Data grouper performs collected log data grouping for
personalized log analysis and stores log data to MongoDB.
Collected log data is handled as shown in Fig. 2
The data grouper handles various schemas of log data in serial
process such as log parsing, rule matching, rule creation, group
finding and log data insertion. When unknown log data is
collected, data grouper creates temporary rules and groups
called collection in MongoDB. Data grouper stores log data by
using rule-based data grouping method.
Therefore, the data grouper stores collected log data
according to data grouping rule and supports high performance
log analysis. More detail regarding the rule-based data
grouping method will be described in section 3.C
3) Pre-processor
Pre-processor performs data mining and workflow
optimization. The data mining is one of general data pre-
processing methods. For example, the pre-processor checks the
log data to handle missing values and noise. And the workflow
optimization is processing to optimize workflow according to
log type, amount of log data and available analysis module
counts. The workflow includes distributed analysis method, log
data summarization and a proper analysis method. Using the
workflow optimization, the PLAS can maintain optimized
analysis performance.
4) Distributed Process Manager
Distributed process manager (DPM) manages statuses of
each node such as database list and R connections. The DPM
performs database searching and scheduling. According to
optimized workflow, the DPM searches databases that logs are
stored. And the DPM schedules workflows for load balance.
In this paper, we consider MongoDB and R to construct the
DPM. MongoDB is only used for managing log data.
Distributed analysis is performed by using R. Therefore,
MongoDB and R are installed in each node and work
complementarily for distributed analysis. To decrease overhead
of database searching, the DPM stores database and collection
list of each MongoDB. And R can communiate as relation of
server-client by using ‘RServe’ and ‘RSclient’ packages.
Hence, we can construct distributed analysis environment by
using MongoDB and R.
5) Big Answer Verifier
Big answer implies significant analysis result in the big
data analysis. The big answer verifier decides analysis result
regarding ‘Is it significant result?’ in the PLAS. Therefore, big
answer verifier can support decision of data analyst by
convergence or cluster analysis of personalized log analysis
result. The big answer verifier creates metadata regarding
analysis process. The metadata includes informations such as
total processing time, analysis environment, analysis request
information, etc. The PLAS uses the metadata as analysis
history to optimize system and improve accuracy of big answer
decision result.
By using the proposed PLAS, IT companies can construct
optimized analysis environment and provide analyst-friendly
system. Moreover, developers and analysts can cooperate by
using the PLAS.
B. Performance Influencing Factors of the PLAS
When it comes to the performance of PLAS, total
processing time is able to be considered. Total processing time
is decided by four elements such as pre-processing time,
distributed process managing time, total analyzing time and big
answer verifying time. Hence, performance influencing factors
are considered as shown in Eq. 1 and each notation is presented
in Table 1.
Ttotal = Tp + Td + Ta + Tv. (1)
TABLE I. NOTATIONS AND EXPLANATIONS OF EQUATION 1
Notation Explanation
Ttotal Total Processing Time
Tp Pre-processing Time
Td Distributed Process Managing Time
Ta Total Analyzing Time
Tv Big Answer Verifying Time
Distributed processing time, although depends on existing
big data system and big answer verifier is not affected by
difference of query path distance. Thus, we suppose that above
two elements have same value, and the Eq. 1, then, represented
as below:
Ttotal = Tp + Ta. (2)
As shown in Eq. 2, pre-processing time (Tp) and total
analyzing time (Ta) are important performance influencing
factors. And each factor is relevant to query path. In other
words, data grouper is the most important step due to the query
path that decided by storing method of data grouper. That is
why we propose rule-based data grouping method.
Fig. 2. Flow Diagram of Data Grouper
111
C. Rule-based Data Grouping Method
The proposed rule-based data grouping method is to
decrease overhead of data pre-processing. Heavy overhead can
occur in personalized log analysis rather than general analysis
due to query path distance. Hence, in the personalized log
analysis, log data might be stored using rule-based data
grouping method to provide shortest distance of query path.
We introduce the rule-based data grouping method in detail by
using an example.
For example, email log used in this paper has two types:
Send log and Web log. The send log is created by sending
email to customer. The send log has information regarding
sending time, customer’s email address, sending result, etc.
The web log is created when customer open the email or click
some advertisement in the email. The web log has information
related to click/non-click advertisement link, clicking time of
advertisement, reading time of email, etc.
In the most case of email log analysis, analysts analyze the
whole email logs for total email sending success rate, sending
failure rate, error codes, etc. As shown, however, personalized
log analysis is totally different from general email log analysis.
In the personalized log analysis, analysts have known business
goals. Analysts interest each customer’s information rather
than email sending success rate or error codes. They will
analyze each customer’s email log related to ‘which time the
customer reads email?’ or ‘what are customer’s interests in
advertisement?’ etc. Therefore, grouping rule of this example
can be customer’s email address and click advertisement link
information due to providing personalized service to each
customer. Service can provides email with personalized
advertisements by using the grouping rule and the PLAS.
As mentioned in above example, various information of
collected log data is considered by analyst in rule-based data
grouping method. The proposed PLAS and rule-based data
grouping method contribute as follows.
D. Summary of Contributions
Personalized log analysis focuses on each customer.
However, general database structure requires the query in the
whole database to find only one customer’s information.
Therefore, overhead can occur in the query processing. To
solve this problem, rule-based data grouping method employs
grouping and storing collected log data by the rule that is
defined by data analyst.
Data analyst defines attribute for grouping rule. Grouping
rule requires considering type of collected data and purpose of
personalized log analysis related business when data analyst
defines it, and then collected log data will be grouped by the
grouping rule and stored in database. In the PLAS, we select
MongoDB that can store log data schema independently. If
some logs have relations among them, logs are stored in a
collection of MongoDB.
Therefore, the proposed PLAS with rule-based data
grouping method contributes to decrease overhead regarding
pre-processing by using shortest distance of query path and
enhance the performance of personalized log analysis since
data analyst perform personalized log analysis only part of
collections related analysis purpose.
IV. NUMERICAL ANALYSIS
Numerical analysis of rule-based data grouping method is
shown in Fig. 3 to Fig. 6. To experiment using practical log
data, We used 20,000 ~ 320,000 email log data of the online
marketing company, Bizisolution in Korea. We evaluate
performance between general storing method and attribute
grouping method in single node MongoDB. Attribute grouping
method stores log data by using rule-based data grouping
method. In the general storing method, log data classified by
log type such as send log and web log. In the attribute grouping
method, log data classified by attributes such as customer’s
email, age, gender, etc. Also, we performed general statistical
analysis for each customer such as email sending success rate,
clicked link type, etc.
A. Total Processing Time
Fig. 3 shows total processing time by number of log data in
each data storing method. Total processing time includes
storing time, pre-processing time and analyzing time as Eq. 2.
Attribute grouping method is shown high performance in every
case. In more than 80,000 of log data case, total processing
time increased both cases owing to log data is evaluated in
single node MongoDB. If we evaluate methods in distributed
process, total processing time will be shorten. However,
attribute grouping case still shows high performance in more
than 80,000 of log data. We find reason of performance
difference in functional analysis.
B. Functional Analysis
To find reason of performance difference between data
storing methods, we perform functional analysis regarding
processing time of pre-processing and analysis modules. We
evaluate processing time of each function as shown in Fig. 4 to
Fig. 6.
Fig. 3. Total Processing time of data storing methods
112
Fig. 4 shows storing time of each storing method. Fig. 5
and Fig. 6 show processing time of each step in case of 20,000
and 320,000 log data. Each step is storing send log, storing web
log and analysis including pre-processing. In the Fig. 4,
Although there are some overhead in the storing process due to
large log data is stored in single node MongoDB at once to
simulate data grouping method, the overhead can be decreased
in real-time processing environment. And Attribute grouping
method required more time to store log data due to log
reconstruction by attribute grouping. However, attribute
grouping case shows high performance in the whole process as
Fig. 5 and Fig. 6. Performance enhanced about 57~88% using
rule-based data grouping method. As shown in this result, if log
data is stored by using attribute grouping method, personalized
log analysis performance can be enhanced. In other words,
rule-based data grouping method makes shorter query path
distance rather than general storing method.
For example, we have email log consist of send log and
web log. We can simply think query path to find customer
information. The rule-based data grouping method needs only
one step query path such as ‘database – user collection’. On the
other hand, general storing method needs more than four steps
query path such as ‘database – send collection – user data and
database – web collection – user data’. Hence, rule-based data
grouping method can improves analysis performance by
making shortest query path.
V. CONCLUSION
In this paper, we proposed the personalized log analysis
system and rule-based data grouping method to improve
personalized analysis performance. In particular, rule-based
data grouping method decreases query overhead by shortest
distance of query path in personalized log analysis.
Although the PLAS has storing overhead, entirely system
performance is improved by enhancing analysis performance
using the data grouper which makes shorter query path distance.
The storing overhead, moreover, can be decreased in real-time
collecting and analyzing environment unlike our experiment
environments. In the future, we will proceed step-by-step
studies to improve the PLAS with real-time data grouping,
workflow optimization, distributed analysis in MongoDB and
R.
ACKNOWLEDGMENT
This research was supported by the MSIP (Ministry of
Science, ICT&Future Planning), Korea, under the ITRC
(Information Technology Research Center) support program
(NIPA-2014(H0301-14-1020)) supervised by the NIPA
(National IT Industry Promotion Agency). Corresponding
Author: Eui-Nam Huh.
REFERENCES
[1] James Manyika, et al, “Big data: The next frontier for
innovation, competition, and productivity”, McKinsey Global
Institute, June 2011.
[2] Kandel, Sean, et al. “Enterprise data analysis and visualization:
An interview study”, IEEE VCG, 2012.
[3] Chen, Hsinchun, Roger HL Chiang, and Veda C. Storey.
"Business Intelligence and Analytics: From Big Data to Big
Impact." MIS Quarterly 36.4, 2012.
Fig. 6. Processing time of 320,000 logs
Fig. 5. Processing time of 20,000 logs
Fig. 4. Total storing time of data storing methods
113
[4] Alam, Jafar Raza, et al. "A Review on the Role of Big Data in
Business”, IJCSMC, 2014.
[5] Pääkkönen, Pekka, and Daniel Pakkala. “Report on Scalability
of database technologies for entertainment services”, nextMedia,
December 2011.
[6] Tudorica, Bodgan George, and Cristian Bucur. “A comparison
between several NoSQL databases with comments and notes”,
Roedunet International Conference (RoEduNet), IEEE, 2011.
[7] Myoungjin Kim, et al. "Design and Implementation of
MongoDB-based Unstructured Log Processing System over
Cloud Computing Environment", Journal of Internet Computing
and Services (JICS) 14.6, 2013, pp.71-84.
[8] Dede, Elif, et al. “Performance evaluation of a mongodb and
hadoop platform for scientific data analysis”, Proceedings of the
4th ACM workshop on Scientific cloud computing, 2013.
[9] Yang Ji Lee, et al. “A Study on Data Pre-filtering Methods for
Fault Diagnosis”, Transactions of the Society of CAD/CAM
Engineers 17.2, 2012, pp.97-110.
114