d9.4 second workshop report - lean big dataversion: 1.0 status: final author(s): fotis aisopos iccs...
TRANSCRIPT
© LeanBigData Consortium Page 1 of 26
Project Acronym: LeanBigData
Project Title: Ultra-Scalable and Ultra-Efficient Integrated and Visual Big Data Analytics
Project Number: 619606
Instrument: STREP
Call Identifier: ICT-2013-11
D9.4 Second Workshop Report
Work Package: WP9 – Exploitation, Industrial Awareness, Dissemination
Due Date: 31/01/2016
Submission Date: 31/01/2016
Start Date of Project: 01/02/2014
Duration of Project: 36 Months
Organisation Responsible for Deliverable: ICCS
Version: 1.0
Status: Final
Author(s): Fotis Aisopos
ICCS
CA
SyncLab
PT
UPM
LeanXcale
FORTH
Intel
INESC
Atos
Reviewer(s): Marta Patiño
Nature: R – Report P – Prototype D – Demonstrator O - Other
Dissemination level: PU - Public CO - Confidential, only for members of the
consortium (including the Commission)
RE - Restricted to a group specified by the consortium (including the Commission Services)
Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)
D9.2.3
© LeanBigData Consortium Page 2 of 26
Revision history Version Date Modified by Comments
0.1 06/12/2016 Fotis Aisopos (ICCS/NTUA)
ToC
0.2 15/12/2016 Fotis Aisopos (ICCS/NTUA)
First Draft Version
0.3 23/12/2016 Vrettos Moulos (ICCS/NTUA)
Full Version
0.4 26/01/2017 Marta Patiño (UPM) Introduction Review
1.0 30/01/2017 Fotis Aisopos (ICCS/NTUA)
Final Version
D9.2.3
© LeanBigData Consortium Page 3 of 26
Executive Summary
LeanBigData targets at the integration of Polyglot Persistence & blending the OLTP and OLAP worlds, respectively. Task 9.3 (“Public Workshops”) aimed at the organisation of two open workshops for the key players in this domain. This is the second task deliverable, reporting on the second public workshop (RTPBD 2016 [1]) held in Heraklion, Greece in conjunction with DISCOTEC 2016 [2]. It was a joint Workshop with another project in Big Data, the CoherentPaaS EU Project [3]. RTPBD presented and demonstrated both projects’ results to a wider public.
The consortium decided to also co-organise a third public workshop (POLYGLOT DATABASES), which was held in conjunction with VLDB conferences in New Delhi, India, also reported in the context of this deliverable. POLYGLOT DATABASES workshop [4] was allocated as a session in the “BOSS'16” (Big Data Open Source Systems) [5]. Both workshops were accepted in VLDB [6], but due to space constraints, accepted workshops had to merge, thus they were joined as a single workshop. VLDB is the most prestigious database-related scientific conference worldwide, thus the joint workshop that LeanBigData was part of was a great chance of highlighting the project results to an expert audience.
D9.2.3
© LeanBigData Consortium Page 4 of 26
Table of Contents
Executive Summary ...................................................................................................... 3
Abbreviations and acronyms ....................................................................................... 6
1. Introduction ............................................................................................................ 7
1.1. Relation with other deliverables ........................................................................ 8
2. Second Project Workshop – RTPBD..................................................................... 9
2.1. Workshop Committee ........................................................................................ 9
2.2. Website, Important Dates and Program ............................................................ 9
2.3. Content and Results ........................................................................................ 11
3. POLYGLOT DATABASES Workshop .................................................................. 13
3.1. Workshop Committee ...................................................................................... 13
3.2. Website and Agenda ....................................................................................... 13
3.3. Content and Results ........................................................................................ 15
4. Summary ............................................................................................................... 16
5. References ............................................................................................................ 17
Annex I. RTPBD Workshop Paper Abstracts ......................................................... 18
Annex II. POLYGLOT DATABASES Workshop Paper Abstracts ............................. 25
D9.2.3
© LeanBigData Consortium Page 5 of 26
Index of Figures
Figure 1: LeanBigData components. ............................................................................................... 8
Figure 2 : DISCOTEC poster .......................................................................................................... 9
Figure 3: RTPBD Workshop page in DISCOTEC website .......................................................... 10
Figure 4 : Image from RTPBD Workshop .................................................................................... 12
Figure 5: BOSS website main page ............................................................................................... 13
Figure 6: POLYGLOT DATABASES Workshop webpage ......................................................... 14
Index of Tables
Table 1: RTPBD Program Committee ............................................................................................ 9
Table 2 : RTPBD Workshop Agenda ............................................................................................ 11
Table 3: POLYGLOT DATABASES Program Committee ......................................................... 13
Table 4: POLYGLOT DATABASES Workshop Agenda ............................................................ 14
Table 5: RTPBD Workshop Paper Titles and Abstracts ............................................................... 24
Table 6: POLYGLOT DATABASES Workshop Paper Titles and Abstracts .............................. 26
D9.2.3
© LeanBigData consortium Page 6 of 26
Abbreviations and acronyms
BOSS Big Data Open Source Systems
CEP Composite Endpoint Protocol
CoherentPaaS Coherent and Rich PaaS with a Common Programming Model
DoW Description of Work
EU European Union
FP Framework Programme
OLAP On-line Analytical Processing
OLTP On-line Transaction Processing
PT Portugal Telecom
RTPBD integRation of polygloT Persistence and BlenDing
VLDB Very Large Data Bases
WP Work Package
D9.2.3
© LeanBigData consortium Page 7 of 26
1. Introduction
LeanBigData is an ultra-scalable and ultra-efficient big data platform integrating in one product the three main big data technologies: a novel transactional NoSQL key-value data store, a distributed complex event processing (CEP) system, and a distributed SQL database. The platform is designed to achieve scalability in a very efficient way avoiding the inefficiencies and delays introduced by current Extract-Transform-Load-based (ETL) approaches. Currently, one of the main issues in data management at enterprises and other organizations is the fact that databases are either operational (OLTP-OnLine Transactional Processing) or analytical (OLAP-OnLine Analytical Processing). This leads to a separation of the management of the operational data performed at operational databases, and the management of analytical queries performed at analytical databases or data warehouses. This separation results in having to copy the data periodically from the operational database into the data warehouse. This copy process is termed Extract-Transform-Load (ETL). ETLs are estimated to consume 75-80% of the budget for business analytics.
LeanBigData solves this issue in data management by bringing a database, LeanXcale, with the two capabilities, operational and analytical.
Another aspect in which LeanBigData innovates lies in the efficiency of the transactional processing and the storage engine. The transactional processing has been re-architected and re-implemented to be an order of magnitude more efficient than the initial version at the beginning of the project. A new storage engine, KiVi, has been architected and implemented from scratch. It is based on a new data structure to be efficient both for range queries and updates.
Another main innovation brought by LeanBigData is in the area of data streaming. Here, the goal has been to produce an efficient scalable distributed complex event processing engine
LeanBigData platform is equipped with a visualization subsystem able to report incremental visualization of results of long analytical queries and with an advanced anomaly detection and root cause analysis module. The visualization subsystem also supports efficient manipulations of visualizations and query results through hand gestures.
Four use cases have been integrated with the developed infrastructure to demonstrate the value of the LeanBigData platform and validate it.
The main components of the LeanBigData platform are shown in Figure 1.
D9.2.3
© LeanBigData consortium Page 8 of 26
Figure 1: LeanBigData components.
The project is divided into nine work packages. This deliverable belongs to work package 9 and is the result of task 9.3.
1.1. Relation with other deliverables
The goal of work package 9 is to promote and empower the dissemination, transfer, collaboration, exploitation, assessment, and broad up-take of the LeanBigData project results to the target audience and stakeholders. D9.4 provides a report on the 2nd Public Project Workshop (“Final Public Workshop from LeanBigData and CoherentPaaS’’ - RTPBD 2016) organised during M28 organized by the LeanBigData project. It is the second deliverable of Task 9.3 “Public workshops”, following the submission of D9.3 reporting on 1st Project Workshop.
Moreover, a report from a 3rd Public Workshop (“POLYGLOT DATABASES” organized as a separate session of BOSS 2016 Workshop) that was co-organised by LeanBigData during M31 is provided.
The rest of the document is structured as follows:
• Chapter 2 presents information about the Second Workshop’s committees, the workshop content and the used dissemination channels. Also, it describes the workshop papers and how they reflect the work done in the context of the project.
• Chapter 3 reports on the Third Workshop in India, its focus and the general outcomes of LeanBigData presented in this workshop.
• Chapter 4 presents a summary of the work done in this task.
• Annex A lists the RTPBD Workshop Papers’ Abstracts and authors
• Annex B lists the POLYGLOT DATABASES Workshop Papers’ Abstracts and authors.
D9.2.3
© LeanBigData consortium Page 9 of 26
2. Second Project Workshop – RTPBD
This chapter provides a full report of the Second Workshop information, agenda and results. RTPBD Workshop was hosted in Heraklion on June 9, 2016, in conjunction with the 11th International Federated Conference on Distributed Computing Techniques (DISCOTEC 2016). The DisCoTec conference series of is one of the major events sponsored by the International Federation for Information processing (IFIP).
Figure 2 : DISCOTEC poster
2.1. Workshop Committee
The Program Committee of the RTPBD workshop is listed in Table 1, and it includes key
persons of the LeanBigData consortium:
Name Affiliation
Dr. Ricardo Jimenez-Peris LeanXcale, Spain
Prof. Marta Patiño-Martinez UPM, Spain
Dr. Patrick Valduriez INRIA, France
Table 1: RTPBD Program Committee
2.2. Website, Important Dates and Program
RTPBD workshop was disseminated through a dedicated web page, included in the main event’s (DISCOTEC 2016) website:
http://2016.discotec.org/index.php?MG=60&Mid=34&sub1=1
A screenshot of this page can be observed in the following Figure 3:
D9.2.3
© LeanBigData consortium Page 10 of 26
Figure 3: RTPBD Workshop page in DISCOTEC website
As defined in the aforementioned page, the important dates set for the Workshop were the following:
• April 17, 2016: Full paper submission
• May 2, 2016: Notification to authors
• June 9, 2016: RTPBD in Heraklion
In total, 18 papers were accepted by the workshop. The workshop chairs in charge of the reviewing process, ensured that all papers got at least two reviews, following a double-blind process.
Those papers were divided in 4 different sessions, as can be observed in the workshop agenda:
Νο. Workshop Session
Paper Title Authors
1 June 9
Session 1 (09:30
- 10:30)
CoherentPaaS Vision Ricardo Jiménez Peris
2 LeanBigData Vision Ricardo Jiménez Peris
3 ActivePivot improvements from
CoherentPaaS
Francois Sabary
4 June 9
Session 2 (11:00
- 12:00)
CQE: A middleware to execute queries
across heterogeneous databases
Raquel Pau
5 Prepared Scan: Efficient Retrieval of
Structured Data from HBase
Francisco Neves, Ricardo Vilaça,
José Pereira and Rui Oliveira
D9.2.3
© LeanBigData consortium Page 11 of 26
A list of all RTPBD paper abstracts can be found in Annex I.
2.3. Content and Results
The workshop overall was a success, attended by over 40 researchers and providing a useful contribution in the domains of big data and transactional data stores. In the context of this workshop, LeanBigData researchers had the chance to present their innovations and report the advancements derived from their work on the project use case scenarios.
6 Big Data Stream Clustering Algorithms
Empirical Evaluation
Annie Ibrahim Rana, Giovani
Estrada and Marc Sole
7 The LeanBigData Data Collection
Framework An innovate and adaptable
framework for collection and
normalization of structured data
Luigi Coppolino, Luigi
Sgaglione, Gaetano Papale and
Ferdinando Campanile
8 June 9
Session 3 (14:00
- 16:00)
Detecting Performance Degradation
with System Level Metrics
Dimitris Ganosis, Yannis
Sfakianakis, Manolis Marazakis
and Angelos Bilas
9 Transactional Support for Cloud Data
Stores
Ricardo Jiménez Peris, Marta
Patiño Martínez, Ivan Brondino
10 Satisfying Telecom and IoT big data
application requirements using multiple
data stores in a coherent way
Vassilis Spitadakis, Dimitrios
Bouras, Yiorgos Panagiotakis
and Apostolos Hatzimanikatis
11 CoherentPaaS - Real-Time Network
Performance Analysis in a Telco
Environment Use Case
Luis Cortesão and Diogo
Regateiro
12 A Taxonomy of Multistore Systems Carlyna Bondiombouy and
Patrick Valduriez
13 Targeted Advertisement case-study: a
LeanBigData benchmark
Jorge Teixeira, Miguel Biscaia,
Ivan Brondino and Mario
Moreira
14 June 9
Session 4 (16:30
- 18:30)
Sentiment Analysis over politics-related
big Twitter datasets
Fotis Aisopos, Vrettos Moulos,
Athanasia Evangelinou
15 STREAM-OPS: a Streaming Operator
Library
Ricardo Jiménez Peris, Valerio
Vianello, Marta Patiño Martínez
16 Transactional MongoDB Pavlos Kranas, Sotiris
Stamokostas, George Vafiadis,
Athanasia Evangelinou
17 Distributed Processing and Transaction
Replication in MonetDB - Towards a
Scalable Analytical Database System in
the Cloud
Ying Zhang, Dimitar Nedev,
Panagiotis Koutsourakis and
Martin Kersten
18 Gestural Interaction with Data Center
3D Visualizations
Giannis Drossis, Chryssi
Birliraki, Nikolaos Patsiouras,
George Margetis,
andConstantine Stephanidis Table 2 : RTPBD Workshop Agenda
D9.2.3
© LeanBigData consortium Page 12 of 26
Figure 4 : Image from RTPBD Workshop
In specific, three of the workshop papers were directly related to LeanBigData Use Cases.
Work in Paper #1 “The LeanBigData Data Collection Framework: An innovate and adaptable framework for collection and normalization of structured data” provided a framework for collection and normalization of structured data and additional features (pre-processing), employed in the context of LeanBigData Financial Analytics Use Case. Paper #6 “Sentiment Analysis over politics-related big Twitter datasets” presented a novel sentiment analysis model for Social Media posts (Twitter) using the technique of n-gram graphs. This document reflected the work done in the context of the Social Media Analytics Use Case, evaluated over a big dataset of tweets posted during the pre-election period in Greece during September 2015.
Finally, Paper#7 provided a Targeted Advertisement case-study as a result of LeanBigData Targeted Advertisement Use Case, using an AdServer created by PT relying on in-house developed tools for handling the high-throughput stream of data and dealing with analysis and visualisation. Results have shown that current LeanBigData platform outstands PT in terms of throughput, although response time using LeanBigData platform in still bellow PT results. LeanBigData roadmap includes several improvements in the key-value data store and SQL engine which are expected to positively impact on these results.
D9.2.3
© LeanBigData consortium Page 13 of 26
3. POLYGLOT DATABASES Workshop
This chapter provides a full report of the Third Workshop information, agenda and results. POLYGLOT DATABASES Workshop was hosted in New Delhi, India on September 9, 2016 conjunction with the 42nd International Conference on VERY LARGE DATA BASES (VLDB 2016). As mentioned above, the workshop had to be merged with the second Workshop on Big Data Open Source Systems (BOSS 2016) workshop, so a joint workshop was held in the context of VLDB, were LeanBigData was a plenary session within the BOSS workshop. BOSS 2016 gave a deep-dive introduction into several active, publicly available, open-source
systems. A screenshot of the workshop website can be seen in Figure 5:
Figure 5: BOSS website main page
3.1. Workshop Committee
The Program Committee of the POLYGLOT DATABASES workshop is listed in Table 3,
including the coordinator of LeanBigData:
Name Affiliation
Prof. Marta Patiño-Martinez UPM, Spain
Dr. Patrick Valduriez INRIA, France
Table 3: POLYGLOT DATABASES Program Committee
3.2. Website and Agenda
POLYGLOT DATABASES workshop was disseminated through a dedicated web page:
http://lsd.ls.fi.upm.es/polyglot2016
A screenshot of this page can be observed in the following Figure 3:
D9.2.3
© LeanBigData consortium Page 14 of 26
Figure 6: POLYGLOT DATABASES Workshop webpage
During the workshop, 4 technical presentations have been presented in the context of 4 different tasks of a full day, as can be observed in the workshop agenda:
A list of all POLYGLOT DATABASES paper abstracts can be found in Annex II.
Νο. Workshop Session
Paper Title Authors
1
Sep’ 9
Task 1
(12:00 - 12:30)
Big Data processing using Polybase Karthik Ramachandra (Microsoft
Gray Systems Lab)
2
Sep’ 9
Task 2
(14:00 - 14:30)
Multistore Systems: Retrospection on
CloudMdsQL
Jose Pereira (Univ. do Minho &
INESC)
3
Sep’ 9
Task 3
(14:30 - 15:00)
Exploiting the data center in
contemporary commodity boxes: The
scaling-in approach
Jignesh Patel (Univ. of
Wisconsin-Madison)
4
Sep’ 9
Task 4
(15:00 - 15:30)
LeanBigData: Blending OLTP and
OLAP to Deliver Real-Time Analytical
Queries
Ricardo Jimenez-Peris
(LeanXcale)
Table 4: POLYGLOT DATABASES Workshop Agenda
D9.2.3
© LeanBigData consortium Page 15 of 26
3.3. Content and Results
The joint workshop was attended by 50+ top researchers interested in polyglot systems and large data bases. High profile participants presented their work, as for example Jignesh Patel, from the David Dewitt’s group and Karthik Ramachandra from the Microsoft’s database group.
Presentation #4 “Blending OLTP and OLAP to Deliver Real-Time Analytical Queries” done by Ricardo Jimenez-Peris provided some general outcomes and advancements achieved by LeanBigData concerning Elastic parallel-distributed query engines with OLTP + OLAP support as well as Efficient use of hardware resources and High scalable and efficient CEP.
D9.2.3
© LeanBigData consortium Page 16 of 26
4. Summary This deliverable reports on the follow-up workshops organised by LeanBigData, after the success of the 1st Public Workshop.
The 1st LeanBigData joint Workshop DataDiversityConvergence 2016 [7] took place in Rome, Italy on 24 April 2016 with 11 papers presented in a full day workshop. The 2nd Public Workshop RTPBD 2016, took place in Heraklion, Greece on June 9, 2016. 18 papers were presented in a full day workshop following and extending results from the 1st Workshop.
Finally, a 3rd joint Workshop “POLYGLOT DATABASES” was organized in New Delhi, India on September 9, 2016, as a session of the BOSS 2016 workshop of VLDB 2016 conference, with 4 relevant presentations.
The aforementioned workshops illustrated the work done in the LeanBigData project and its Use Case scenarios validation. In the context of the papers produced, important open issues in big data analytics were addressed, such as scalability and efficiency, and innovative big data management technologies for processing streaming events and different workloads over stored data were provided. Research work performed on Normalization of Structured Data and Sentiment Analysis explored new challenges that emerged during the project lifetime, concluding into useful outcomes in the area of big data management and visualization.
D9.2.3
© LeanBigData consortium Page 17 of 26
5. References [1]. RTPBD 2016, http://2016.discotec.org/index.php?MG=60&Mid=34&sub1=1
[2]. DISCOTEC 2016, http://2016.discotec.org/
[3]. CoherentPaaS project, funded from the European Union's 7th FP for research, technological
development and demonstration under grant agreement no 611068
[4]. Polyglot Databases Session, http://lsd.ls.fi.upm.es/polyglot2016
[5]. BOSS’ 2016, http://boss.dima.tu-berlin.de/
[6]. VLDB 2016, http://vldb2016.persistent.com/
[7]. DataDiversityConvergence 2016,
http://closer.scitevents.org/DataDiversityConvergence.aspx?y=2016
D9.2.3
© LeanBigData consortium Page 18 of 26
Annex I. RTPBD Workshop Paper Abstracts In Table 5:, a list of the RTPBD papers titles, abstracts and authors are reported.
Paper #
Publication Title
Abstract Author(s)
1 CoherentPaaS
Vision
There has been a blooming of new data stores
addressing new challenges in data management of
structured, semi-structured and unstructured data in
the last years. In the semi-structured segment, many
different kinds of NoSQL data stores have emerged
proposing new data models and associated query
languages and/or APIs appropriate for them, such as
document-oriented data stores, key-value data stores
and graph databases. On the structured data world,
new technologies have emerged such as NewSQL,
columnar data warehouses or in-memory databases.
This adoption of new data stores store has led to a
proliferation of data stores at organizations raising
interesting challenges. On one hand, NoSQL data
stores were born in a world where scalability was
considered a key attribute and many of them attained
this scalability by dismissing the transactional
properties, since it was the main bottleneck in data
management technology preventing from achieving
scalability. This design decision has enabled many
NoSQL data stores to achieve different degrees of
scalability but at the cost of losing consistency in the
advent of failures and/or concurrent access. On the
other hand, having multiple data stores results in the
so-called polyglot persistent environments that create
two interesting problems. The first one happens when
a business action requires updating data across
multiple data stores, which lacks the transactional
consistency provided by OLTP databases, resulting
in a loss of consistency when conflicting accesses
happen or a failure occurs at one or more nodes. The
second one is that exploiting the data in a polyglot
persistence environment is an extremely hard task
unreachable for most organizations. The reason is
that different data stores provide different query
languages or APIs. This force to program a query
engine at the application level that is obviously not
feasible for most organizations or to move all the
data to a data warehouse forcing to put a relational
schema to data that is schema-less such as the one
stored on NoSQL data stores. It turns out that the
motivation to introduce data stores is precisely to
prevent to put under a relational schema data not
suited for such model and requiring the flexibility of
other data models such as documents, key-value or
graphs. CoherentPaaS comes to solve the three above
problems. On one hand, it solves the pains related to
updates across different data sores by providing a
Ricardo Jiménez Peris
D9.2.3
© LeanBigData consortium Page 19 of 26
holistic transactional manager that enriches NoSQL
and other transaction-less data stores with
transactional semantics without affecting their
scalability and to execute holistic or global
transactions across any subset of data stores
integrated into the CoherentPaaS platform. On the
other hand, it solves the problem related to queries in
polyglot persistence environments by enabling the
combination of SQL with the native query languages
and naive APIs to make queries across data stores
with arbitrary data models and their query processing
framework.
2 LeanBigData
Vision
LeanBigData targets at building an ultra-scalable and
ultra-efficient integrated big data platform addressing
important open issues in big data analytics. Current
big data infrastructure scale to large amounts of data
and system sizes, however, in a very inefficient way
consuming disproportionally high resources per data
item processed. Furthermore, the lack of integrated
big data management technologies to process
streaming events and different workloads over stored
data results in the complexity to integrate disparate
big data systems and the overhead of copying data
across systems. What is more, data analysis cycles to
refine queries and identify facts of interest take
hours, days, or weeks, whereas business processes
demand today shorter cycles. LeanBigData is
addressing these issues by delivering ultra-scalable
big data management systems such as NoSQL key-
value data store, a distributed CEP system, and a
distributed SQL query engine, by providing an
integrated big data platform to avoid the
inefficiencies and delays introduced by current ETL-
based integration approaches of disparate
technologies, and by supporting an end-to-end big
data analytics solution removing the main sources of
delays
Ricardo Jiménez Peris
3 ActivePivot
improvements
from
CoherentPaaS
Concurrency control is one of the most important and
performance- critical feature of a database. Up to its
version 4, ActivePivot was using a simple read-write
lock to enforce mutual exclusion between queries
and transactions. While functionally correct this
technique causes various performance issues, the
most visible one being that long-running queries
prevent new data from being inserted in the database.
In this paper, we will see how multi-version
concurrency control has been implemented in
ActivePivot to solve these problems. We will also
show how this new mechanism has been leveraged to
implement both as-of and what-if analysis, two new
defining features of ActivePivot.
Francois Sabary
D9.2.3
© LeanBigData consortium Page 20 of 26
4 CQE: A
middleware to
execute
queries
accross
heterogeneous
databases
The data management world has been evolving
towards a large diversity of databases. This blooming
of data management technologies, where each
technology is specialized and optimal for specific
processing, has led to a no one size fits all situation.
Consequently, software applications usually need to
use different database technologies simultaneously.
In this situation, software developers need to deal
with several data integration issues because these
cannot apply SQL queries across databases. In order
to solve this gap, we present a middleware called
Common Query Engine (CQE) based on an open
source technology called Apache Derby to execute
SQL-like queries to integrate results from different
data management technologies.
Raquel Pau
5 Prepared
Scan:
Efficient
Retrieval of
Structured
Data from
HBase
NoSQL systems are getting more popular and, for
this reason, they are constantly the target of
performance evaluation and optimization. The ability
of these systems to scale better than traditional
relational databases, so much for being schema-less,
motivates a large set of applications to migrate their
data to these systems, even without the intention to
exploit schema flexibility provided by NoSQL
systems. However, accessing structured data in
schema-less systems is costly due to provided
flexibility, incurring in an increase of bandwidth
usage and thus overall performance degradation.
In this paper, we analyse this cost in Apache HBase,
a distributed, opensource and non-relational
database. We propose a new operation named
Prepared Scan that optimizes access to data which
follows a well-defined and regular schema by taking
advantage of this in the context of applications.
We evaluate its performance using an industry
standard benchmark for both non-relational and
relational databases. The evaluation shows that
Prepared Scan improves throughput up to 25% while
reducing network bandwidth consumption up to
20%.
Francisco Neves, José
Pereira, Ricardo
Vilaça and Rui
Oliveira
6 Big Data
Stream
Clustering
Algorithms
Empirical
Evaluation
The nature of the data passing through data-driven
organizations
is changing dramatically. Despite clear technological
advances, analyzing big data and extracting valuable
knowledge is still a great challenge. Big data needs
big storage and highly frequent volumes of data
streams make operations such as analytical
operations, process operations, retrieval operations
real difficult and hugely time consuming. Due to
evolving nature of the data, unsupervised methods
are recommended. Big data summarization requires
lesser storage and extremely shorter time to get
processed and retrieved. Stream clustering is an
efficient strategy against mining of evolving big data.
In this article, we evaluated most popular stream
Annie Ibrahim Rana ,
Giovani Estrada, and
Marc Solé
D9.2.3
© LeanBigData consortium Page 21 of 26
clustering techniques using the simulated data and
the data collected from our test-bed, and presented
our evaluation results.
7 The
LeanBigData
Data
Collection
Framework
An innovate
and adaptable
framework for
collection and
normalization
of structured
data
The data collection for eventual analysis is an old
concept that today receives a revisited interest due to
the emerging of new research trend such Big Data.
Furthermore, considering that a current market trend
is to provide integrated solution to achieve multiple
purposes (such as ISOC, SIEM, CEP, etc.), the data
became very heterogeneous. In this paper is
presented an innovative and adaptable framework for
collection and normalization of structured data,
describing the approach used to collect structured
data and the additional features (pre-processing)
provided with it.
Luigi Coppolino ,
Luigi Sgaglione ,
Gaetano Papale , and
Ferdinando Campanile
8 Detecting
Performance
Degradation
with System
Level Metrics
With recent consolidation trends, it is today typical to
run multiple applications on servers within
datacenters. Due to the diversity of applications and
the lack of generic models for comprehension of
application behavior, data centers (DC) operators and
providers cannot easily infer how well
an application is behaving, with respect to user-
expectation. Therefore, they tend to resort to
application-level metrics that are not always easy to
obtain via instrumentation.
In this work we examine how system-level metrics
can be used to infer application behavior and
specifically performance degradation when mixed
workloads are deployed on consolidated servers. We
introduce a lightweight and inexpensive monitoring
framework that detects a performance degradation of
unknown applications by using only system level
metrics. We examine more than
700 hardware and software level metrics and identify
correlations with various application classes. Using
this analysis, we reduce these metrics to 24 useful
metrics. Then, we use automated profiling to
establish acceptable baselines for applications and we
monitor applications dynamically during execution to
identify any significant degradation. We evaluate our
approach on ten benchmarks of different categories
(cpu, memory, I/O, and network intensive
applications) and we find that our approach is able to
identify changes in application behavior without any
specific knowledge about the applications.
Dimitris Ganosis,
Yannis Sfakianakis ,
Manolis Marazakis,
and Angelos Bilas
9 Transactional
Support for
Cloud Data
Stores
In the last decade it has been observed an exponential
explosion of generated user data over the internet.
Traditional data management solutions such as
relational databases are simply not able to process
this large amount of data in a reasonable time. A new
need of high scalable data management tools
Ricardo Jiménez Peris,
Marta Patiño Martínez
and Ivan Brondino
D9.2.3
© LeanBigData consortium Page 22 of 26
emerged, the cloud data stores. These technologies
are able to process petabytes of data but with an
important trade-off: the lack of transactional
consistency. The scenario becomes even more
complex for those applications whose building
blocks are on top of a hybrid data store ecosystem.
This works presents a novel protocol to provide
transaction semantics on top of heterogeneous data
stores transparently to applications.
10 Satisfying
Telecom and
IoT big data
application
requirements
using multiple
data stores in
a coherent
way
As services are moving onto the cloud and dataset
volumes are exploding, data management becomes
extremely demanding. Multiple data store
technologies emerged (noSQL, in-memory,
columnar) to address specific needs. When all needs
are brought together in an application, big data
solutions should respond efficiently. CoherentPaaS
project addresses these issues by providing a rich
PaaS with a diversity of data stores and data
management technologies optimized for particular
tasks and work-loads. CoherentPaaS integrates data
stores, and complex event processing systems with
holistic transactional coherence and a common query
language to enable data correlation across stores. To
assess project results, two use cases are implemented,
both addressing different needs of telecoms: a price
simulation application and a platform to deploy IoT
services such as vehicle monitoring. Combination of
different data stores and integration with real-time
stream queries is brought at the development focus.
Benefits and achievements are measured, assessed
and reported.
Vassilis Spitadakis,
Dimitrios Bouras,
Yiorgos Panagiotakis
and Apostolos
Hatzimanikatis
11 CoherentPaaS
- Real-Time
Network
Performance
Analysis in a
Telco
Environment
Use Case
In this paper we present the Real-Time Network
Performance Analysis in a Telco Environment use
case for CoherentPaaS project. It is based on Altice
Labs product Altaia, which aims to detect network
problems before any degradation or unavailability of
services occur, by actively supervising it. How-ever,
monitoring the whole network implies analyzing big
amounts of data in real-time and the current solution
does not provide the required degree of scalability.
The plan is to use CoherentPaaS to provide a rich
Platform-as-a-Service that supports several data
stores accessible via a uniform programming model
and language and complying to demanding delay,
throughput and data volume requirements. All the
required transformation will be presented, as well as
the expected associated benefits.
Luis Cortesão and
Diogo Regateiro
12 A Taxonomy
of Multistore
Systems
Building cloud data-intensive applications often
require using multiple data stores (NoSQL, HDFS,
RDBMS, etc.). However, the wide diversification of
data store interfaces makes it difficult to integrate
data from multiple data stores. This problem has
motivated the design of a new generation of systems,
Carlyna Bondiombouy
and Patrick Valduriez
D9.2.3
© LeanBigData consortium Page 23 of 26
called multistore systems, which provide integrated
or transparent access to a number of cloud data stores
through one or more query languages. In this paper,
we give taxonomy of multistore systems, based on
their architecture, data model, query languages and
query processing techniques. To ease comparison, we
divide multistore systems based on the level of
coupling with the underlying data stores, i.e. loosely-
coupled, tightly-coupled and hybrid.
13 Targeted
Advertisement
case-study: a
LeanBigData
benchmark
In this paper we present Targeted Advertisement
case-study, a big data case-study for LeanBigData
project. PT (Portugal Telecom) sells multi-platform
ads online covering the whole spectrum of web,
mobile and TV, allowing advertisers to define their
own campaigns, set their campaign goals and budget,
their choice of paid words, as well as many other
constraints including geographic and demographic of
the targeted customers. To reliably provide efficient,
contextualized and targeted advertisements to final
users, the current architecture of PT AdServer relies
on in-house developed tools for handling the high-
throughput stream of data and to deal with analysis
and visualization. We present a benchmark study
performed on LeanBigData platform tested with real
PT needs in terms of data throughput and scalability.
Jorge Teixeira ,
Miguel Biscaia , Iván
Brondino and Mario
Moreira
14 Sentiment
Analysis over
politics-
related big
datasets
In this paper, we propose a novel sentiment analysis
technique for messages exchanged in Twitter, related
political debates, using the novel technique of n-gram
graphs in the context of a big data analysis platform
(LeanBigData). Towards this direction, we make use
of a big dataset of tweets posted during the pre-
election period in Greece on September 2015. Tweets
are divided among three sentiment classes, according
to specific patterns recognized (emoticons) to create
the respective n-gram graphs and calculate node
similarities, so as to train multiple classifiers and
compare their behavior. Also, we compare the
performance of various sub-sets to come up with the
optimal patterns to aggregate tweets for a concrete
analysis. Classification experiments provided
promising results, comparing to the real sentiment of
the testing dataset, as described in the last section of
the paper.
Fotis Aisopos, Vrettos
Moulos, John Violos,
Theodora Varvarigou,
Pavlos
Kranas, Sotiris
Stamokostas, Dimos
Kyriazis, Adreas
Menychtas, Kleopatra
Konstanteli, George
Vafiadis, Athanasia
Evangelinou, Christina
Santzaridou,
Vasileios
Anagnostopoulos, and
Anna Gatzioura
Nikiforos Makrinakis
15 STREAM-
OPS: a
Streaming
Operator
Library
Nowadays applications consume huge amount of live
data with the requirement of real-time processing.
Complex Event Processing (CEP) represents a
promising technology to allow these applications to
process large amount of information in real-time.
Some of the available CEP systems, like Apache
Storm, provide a programmatic model for the
definition of the continuous queries used to process
data on the fly. This paper presents STREAM-OPS, a
library of streaming operators written in JAVA that is
Ricardo Jiménez-
Peris, ValerioVianello
and Marta Patiño-
Martínez
D9.2.3
© LeanBigData consortium Page 24 of 26
designed to ease the process of streaming query
definition. In the paper we also present an evaluation
of the STREAM-OPS library when integrated into
two CEP systems developed in the context of
CoherentPaaS and LeanBigData European projects.
16 Transactional
MongoDB
The wide adaptation of NoSQL data-stores and
among them the document data-stores, because of
their rich functionalities and their performance, has
led to the need for also providing from them
transactional semantics. MongoDB, a very popular
document data-store, require from the application
developer to implement his own two phase commit
protocol in case the application requires ensuring
ACID properties. In this paper, it is presented an
extension of the official MongoDB client for
providing transactional semantics and Snapshot
Isolation.
Pavlos Kranas , Sotiris
Stamokostas, George
Vafiadis, Athanasia
Evangelinou,
Christina Santzaridou,
Alexandros Psychas,
Georgios
Palaiokrassas,
Achilleas
Marinakis, Fotis
Aisopos and Vrettos
Moulos
17 Distributed
Processing
and
Transaction
Replication in
MonetDB -
Towards a
Scalable
Analytical
Database
System in the
Cloud
Thanks to its flexibility (i.e. new computing jobs can
be set up in minutes, without having to wait for
hardware procurement) and elasticity (i.e. more or
less resources can be allocated to instantly match the
current workload), cloud computing has rapidly
gained much interests from both academic and
commercial users. Increasingly moving into the
cloud is a clear trend in the software developments.
To provide its users a fast in-memory optimized
analytical database system with all the conveniences
of the cloud environment, we embarked upon
extending the open-source column store database
MonetDB with new features to make it cloud-ready.
In the paper, we elaborate the new distributed and
replicated transaction features in MonetDB. The
distributed query processing feature allows MonetDB
to horizontally scale-out to multiple machines; while
the transaction replication schemes increase the
availability of the MonetDB database servers.
Ying Zhang, Dimitar
Nedev, Panagiotis
Koutsourakis and
Martin Kersten
18 Gestural
Interaction
with Data
Center 3D
Visualizations
This paper reports on ongoing work regarding
gestural interaction with 3D visualizations of large-
scale data centers in the context of large scale data
center infrastructure management. The visualization
renders a virtual area of real data centers preserving
the actual arrangement of their servers and displays
their current state, including several condition
indicators, updated in real time, as well as a color-
coding scheme for the current servers’ condition
referring to a scale from normal to critical. This
paper focuses on the interaction requirements
regarding the exploration of the 3D visualization.
Towards this direction, the use of gestures as a means
of natural interaction is suggested in order to provide
intuitive, natural and rich interaction with the
demanding environment of a data center.
Giannis Drossis,
Chryssi Birliraki,
Nikolaos Patsiouras,
George Margetis and
Constantine
Stephanidis
Table 5: RTPBD Workshop Paper Titles and Abstracts
D9.2.3
© LeanBigData consortium Page 25 of 26
Annex II. POLYGLOT DATABASES Workshop Paper Abstracts In Table 6, a list of the POLYGLOT DATABASES papers titles, abstracts and authors are reported.
Paper #
Publication Title
Abstract Author(s)
1 Big Data
processing
using
Polybase
Abstract: To make good decisions, business decision
makers need to analyze both relational data and other
data that is not structured into tables – notably, data
stored in Hadoop and other similar Big Data systems.
This is difficult to do unless there exists an efficient
way to process queries that access data across these
different types of data stores. PolyBase bridges this
gap by operating on data that is external to Microsoft
SQL Server. PolyBase is a technology that accesses
and combines both non-relational and relational data,
all from within SQL Server. It allows queries on
external data in Hadoop or Azure blob storage. The
queries are optimized to push computation to Hadoop
when beneficial. The talk will give an overview of
Polybase and describe the architecture of Polybase in
SQL Server 2016. Some of the key technical
challenges and design approaches will also be
discussed.
Karthik Ramachandra
2 Multistore
Systems:
Retrospection
on
CloudMdsQL
The blooming of different cloud data management
infrastructures has turned multistore systems to a
major topic in the nowadays cloud landscape. In this
paper, we give an overview of the Cloud
Multidatastore Query Language (CloudMdsQL), and
the implementation of its query engine.
CloudMdsQL is a functional SQL-like language,
capable of querying multiple heterogeneous data
stores (relational, NoSQL, HDFS) within a single
query that can contain embedded invocations to each
data store’s native query interface. The major
innovation is that a CloudMdsQL query can exploit
the full power of local data stores, by simply
allowing some local data store native queries (e.g. a
breadth-first search query against a graph database)
to be called as functions, and at the same time be
optimized, e.g. by pushing down select predicates,
using bind join, performing join ordering, or
planning intermediate data shipping.
Jose Pereira
3 Exploiting the
data center in
contemporary
commodity
boxes: The
scaling-in
approach
Modern servers pack enough storage and computing
power that just a decade ago was spread across a
modest-sized cluster. In addition, we are on a
technological roadmap in which the storage and
compute densities of individual server nodes will
continue to increase at a faster rate that the networks
that connect the nodes. Thus, we must complement
methods that focus on "scaling-out" by also
developing methods to "scale-in" to fully exploit the
Jignesh Patel
D9.2.3
© LeanBigData consortium Page 26 of 26
hardware capabilities that are packed in each server
node. This is especially true for an important class of
real-time in-memory analytic data applications. The
recent Apache-incubated Quickstep project focuses
on this scaling-in aspect. Quickstep uses novel
methods for organizing data (including columnar and
hybrid storage organization), template
metaprogramming for vectorized query execution,
and a query execution paradigm that separates
control-flow from data-flow. Collectively, these
methods produce more than an order-of-magnitude
performance improvement over many existing open-
source platforms.
4 Blending
OLTP and
OLAP to
Deliver Real-
Time
Analytical
Queries
Traditionally, OLTP and OLAP workloads have been
served by different kinds of databases systems,
transactional databases and data warehouses. This
separation has resulted in having to organize a
process to copy the data from the operational
database into the data warehouse known as extract-
transform-load (ETL). This process is estimated to
cost 80% of the budget of doing business analytics.
LeanXcale is a NewSQL database that scales
transactional processing in a linear manner to 100s of
nodes. With this it provides an ultra-scalable OLTP
database. Thanks to its ability to scale the OLTP
engine as much as needed an OLAP engine has been
built that works over the operational data delivering
in this way real-time analytical queries.
Ricardo Jimenez-Peris
Table 6: POLYGLOT DATABASES Workshop Paper Titles and Abstracts