d9.4 second workshop report - lean big dataversion: 1.0 status: final author(s): fotis aisopos iccs...

© LeanBigData Consortium of 26

Project Acronym: LeanBigData

Project Title: Ultra-Scalable and Ultra-Efficient Integrated and Visual Big Data Analytics

Project Number: 619606

Instrument: STREP

Call Identifier: ICT-2013-11

D9.4 Second Workshop Report

Work Package: WP9 – Exploitation, Industrial Awareness, Dissemination

Due Date: 31/01/2016

Submission Date: 31/01/2016

Start Date of Project: 01/02/2014

Duration of Project: 36 Months

Organisation Responsible for Deliverable: ICCS

Version: 1.0

Status: Final

Author(s): Fotis Aisopos

ICCS

CA

SyncLab

PT

UPM

LeanXcale

FORTH

Intel

INESC

Atos

Reviewer(s): Marta Patiño

Nature: R – Report P – Prototype D – Demonstrator O - Other

Dissemination level: PU - Public CO - Confidential, only for members of the

consortium (including the Commission)

RE - Restricted to a group specified by the consortium (including the Commission Services)

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

D9.2.3


Revision history Version Date Modified by Comments

0.1 06/12/2016 Fotis Aisopos (ICCS/NTUA)

ToC


First Draft Version

0.3 23/12/2016 Vrettos Moulos (ICCS/NTUA)

Full Version

0.4 26/01/2017 Marta Patiño (UPM) Introduction Review


Final Version

D9.2.3


Executive Summary

LeanBigData targets at the integration of Polyglot Persistence & blending the OLTP and OLAP worlds, respectively. Task 9.3 (“Public Workshops”) aimed at the organisation of two open workshops for the key players in this domain. This is the second task deliverable, reporting on the second public workshop (RTPBD 2016 [1]) held in Heraklion, Greece in conjunction with DISCOTEC 2016 [2]. It was a joint Workshop with another project in Big Data, the CoherentPaaS EU Project [3]. RTPBD presented and demonstrated both projects’ results to a wider public.

The consortium decided to also co-organise a third public workshop (POLYGLOT DATABASES), which was held in conjunction with VLDB conferences in New Delhi, India, also reported in the context of this deliverable. POLYGLOT DATABASES workshop [4] was allocated as a session in the “BOSS'16” (Big Data Open Source Systems) [5]. Both workshops were accepted in VLDB [6], but due to space constraints, accepted workshops had to merge, thus they were joined as a single workshop. VLDB is the most prestigious database-related scientific conference worldwide, thus the joint workshop that LeanBigData was part of was a great chance of highlighting the project results to an expert audience.

D9.2.3


Table of Contents

Executive Summary ...................................................................................................... 3

Abbreviations and acronyms ....................................................................................... 6

1. Introduction ............................................................................................................ 7

1.1. Relation with other deliverables ........................................................................ 8

2. Second Project Workshop – RTPBD..................................................................... 9

2.1. Workshop Committee ........................................................................................ 9

2.2. Website, Important Dates and Program ............................................................ 9

2.3. Content and Results ........................................................................................ 11

3. POLYGLOT DATABASES Workshop .................................................................. 13

3.1. Workshop Committee ...................................................................................... 13

3.2. Website and Agenda ....................................................................................... 13

3.3. Content and Results ........................................................................................ 15

4. Summary ............................................................................................................... 16

5. References ............................................................................................................ 17

Annex I. RTPBD Workshop Paper Abstracts ......................................................... 18

Annex II. POLYGLOT DATABASES Workshop Paper Abstracts ............................. 25

D9.2.3


Index of Figures

Figure 1: LeanBigData components. ............................................................................................... 8

Figure 2 : DISCOTEC poster .......................................................................................................... 9

Figure 3: RTPBD Workshop page in DISCOTEC website .......................................................... 10

Figure 4 : Image from RTPBD Workshop .................................................................................... 12

Figure 5: BOSS website main page ............................................................................................... 13

Figure 6: POLYGLOT DATABASES Workshop webpage ......................................................... 14

Index of Tables

Table 1: RTPBD Program Committee ............................................................................................ 9

Table 2 : RTPBD Workshop Agenda ............................................................................................ 11

Table 3: POLYGLOT DATABASES Program Committee ......................................................... 13

Table 4: POLYGLOT DATABASES Workshop Agenda ............................................................ 14

Table 5: RTPBD Workshop Paper Titles and Abstracts ............................................................... 24

Table 6: POLYGLOT DATABASES Workshop Paper Titles and Abstracts .............................. 26

D9.2.3

© LeanBigData consortium of 26

Abbreviations and acronyms

BOSS Big Data Open Source Systems

CEP Composite Endpoint Protocol

CoherentPaaS Coherent and Rich PaaS with a Common Programming Model

DoW Description of Work

EU European Union

FP Framework Programme

OLAP On-line Analytical Processing

OLTP On-line Transaction Processing

PT Portugal Telecom

RTPBD integRation of polygloT Persistence and BlenDing

VLDB Very Large Data Bases

WP Work Package

D9.2.3


1. Introduction

LeanBigData is an ultra-scalable and ultra-efficient big data platform integrating in one product the three main big data technologies: a novel transactional NoSQL key-value data store, a distributed complex event processing (CEP) system, and a distributed SQL database. The platform is designed to achieve scalability in a very efficient way avoiding the inefficiencies and delays introduced by current Extract-Transform-Load-based (ETL) approaches. Currently, one of the main issues in data management at enterprises and other organizations is the fact that databases are either operational (OLTP-OnLine Transactional Processing) or analytical (OLAP-OnLine Analytical Processing). This leads to a separation of the management of the operational data performed at operational databases, and the management of analytical queries performed at analytical databases or data warehouses. This separation results in having to copy the data periodically from the operational database into the data warehouse. This copy process is termed Extract-Transform-Load (ETL). ETLs are estimated to consume 75-80% of the budget for business analytics.

LeanBigData solves this issue in data management by bringing a database, LeanXcale, with the two capabilities, operational and analytical.

Another aspect in which LeanBigData innovates lies in the efficiency of the transactional processing and the storage engine. The transactional processing has been re-architected and re-implemented to be an order of magnitude more efficient than the initial version at the beginning of the project. A new storage engine, KiVi, has been architected and implemented from scratch. It is based on a new data structure to be efficient both for range queries and updates.

Another main innovation brought by LeanBigData is in the area of data streaming. Here, the goal has been to produce an efficient scalable distributed complex event processing engine

LeanBigData platform is equipped with a visualization subsystem able to report incremental visualization of results of long analytical queries and with an advanced anomaly detection and root cause analysis module. The visualization subsystem also supports efficient manipulations of visualizations and query results through hand gestures.

Four use cases have been integrated with the developed infrastructure to demonstrate the value of the LeanBigData platform and validate it.

The main components of the LeanBigData platform are shown in Figure 1.

D9.2.3


Figure 1: LeanBigData components.

The project is divided into nine work packages. This deliverable belongs to work package 9 and is the result of task 9.3.

1.1. Relation with other deliverables

The goal of work package 9 is to promote and empower the dissemination, transfer, collaboration, exploitation, assessment, and broad up-take of the LeanBigData project results to the target audience and stakeholders. D9.4 provides a report on the 2nd Public Project Workshop (“Final Public Workshop from LeanBigData and CoherentPaaS’’ - RTPBD 2016) organised during M28 organized by the LeanBigData project. It is the second deliverable of Task 9.3 “Public workshops”, following the submission of D9.3 reporting on 1st Project Workshop.

Moreover, a report from a 3rd Public Workshop (“POLYGLOT DATABASES” organized as a separate session of BOSS 2016 Workshop) that was co-organised by LeanBigData during M31 is provided.

The rest of the document is structured as follows:

• Chapter 2 presents information about the Second Workshop’s committees, the workshop content and the used dissemination channels. Also, it describes the workshop papers and how they reflect the work done in the context of the project.

• Chapter 3 reports on the Third Workshop in India, its focus and the general outcomes of LeanBigData presented in this workshop.

• Chapter 4 presents a summary of the work done in this task.

• Annex A lists the RTPBD Workshop Papers’ Abstracts and authors

• Annex B lists the POLYGLOT DATABASES Workshop Papers’ Abstracts and authors.

D9.2.3


2. Second Project Workshop – RTPBD

This chapter provides a full report of the Second Workshop information, agenda and results. RTPBD Workshop was hosted in Heraklion on June 9, 2016, in conjunction with the 11th International Federated Conference on Distributed Computing Techniques (DISCOTEC 2016). The DisCoTec conference series of is one of the major events sponsored by the International Federation for Information processing (IFIP).

Figure 2 : DISCOTEC poster

2.1. Workshop Committee

The Program Committee of the RTPBD workshop is listed in Table 1, and it includes key

persons of the LeanBigData consortium:

Name Affiliation

Dr. Ricardo Jimenez-Peris LeanXcale, Spain

Prof. Marta Patiño-Martinez UPM, Spain

Dr. Patrick Valduriez INRIA, France

Table 1: RTPBD Program Committee

2.2. Website, Important Dates and Program

RTPBD workshop was disseminated through a dedicated web page, included in the main event’s (DISCOTEC 2016) website:

http://2016.discotec.org/index.php?MG=60&Mid=34&sub1=1

A screenshot of this page can be observed in the following Figure 3:


D9.2.3


Figure 3: RTPBD Workshop page in DISCOTEC website

As defined in the aforementioned page, the important dates set for the Workshop were the following:

• April 17, 2016: Full paper submission

• May 2, 2016: Notification to authors

• June 9, 2016: RTPBD in Heraklion

In total, 18 papers were accepted by the workshop. The workshop chairs in charge of the reviewing process, ensured that all papers got at least two reviews, following a double-blind process.

Those papers were divided in 4 different sessions, as can be observed in the workshop agenda:

Νο. Workshop Session

Paper Title Authors

1 June 9

Session 1 (09:30

- 10:30)

CoherentPaaS Vision Ricardo Jiménez Peris

2 LeanBigData Vision Ricardo Jiménez Peris

3 ActivePivot improvements from

CoherentPaaS

Francois Sabary

4 June 9

Session 2 (11:00

- 12:00)

CQE: A middleware to execute queries

across heterogeneous databases

Raquel Pau

5 Prepared Scan: Efficient Retrieval of

Structured Data from HBase

Francisco Neves, Ricardo Vilaça,

José Pereira and Rui Oliveira

D9.2.3


A list of all RTPBD paper abstracts can be found in Annex I.

2.3. Content and Results

The workshop overall was a success, attended by over 40 researchers and providing a useful contribution in the domains of big data and transactional data stores. In the context of this workshop, LeanBigData researchers had the chance to present their innovations and report the advancements derived from their work on the project use case scenarios.

6 Big Data Stream Clustering Algorithms

Empirical Evaluation

Annie Ibrahim Rana, Giovani

Estrada and Marc Sole

7 The LeanBigData Data Collection

Framework An innovate and adaptable

framework for collection and

normalization of structured data

Luigi Coppolino, Luigi

Sgaglione, Gaetano Papale and

Ferdinando Campanile

8 June 9

Session 3 (14:00

- 16:00)

Detecting Performance Degradation

with System Level Metrics

Dimitris Ganosis, Yannis

Sfakianakis, Manolis Marazakis

and Angelos Bilas

9 Transactional Support for Cloud Data

Stores

Ricardo Jiménez Peris, Marta

Patiño Martínez, Ivan Brondino

10 Satisfying Telecom and IoT big data

application requirements using multiple

data stores in a coherent way

Vassilis Spitadakis, Dimitrios

Bouras, Yiorgos Panagiotakis

and Apostolos Hatzimanikatis

11 CoherentPaaS - Real-Time Network

Performance Analysis in a Telco

Environment Use Case

Luis Cortesão and Diogo

Regateiro

12 A Taxonomy of Multistore Systems Carlyna Bondiombouy and

Patrick Valduriez

13 Targeted Advertisement case-study: a

LeanBigData benchmark

Jorge Teixeira, Miguel Biscaia,

Ivan Brondino and Mario

Moreira

14 June 9

Session 4 (16:30

- 18:30)

Sentiment Analysis over politics-related

big Twitter datasets

Fotis Aisopos, Vrettos Moulos,

Athanasia Evangelinou

15 STREAM-OPS: a Streaming Operator

Library

Ricardo Jiménez Peris, Valerio

Vianello, Marta Patiño Martínez

16 Transactional MongoDB Pavlos Kranas, Sotiris

Stamokostas, George Vafiadis,

Athanasia Evangelinou

17 Distributed Processing and Transaction

Replication in MonetDB - Towards a

Scalable Analytical Database System in

the Cloud

Ying Zhang, Dimitar Nedev,

Panagiotis Koutsourakis and

Martin Kersten

18 Gestural Interaction with Data Center

3D Visualizations

Giannis Drossis, Chryssi

Birliraki, Nikolaos Patsiouras,

George Margetis,

andConstantine Stephanidis Table 2 : RTPBD Workshop Agenda

D9.2.3


Figure 4 : Image from RTPBD Workshop

In specific, three of the workshop papers were directly related to LeanBigData Use Cases.

Work in Paper #1 “The LeanBigData Data Collection Framework: An innovate and adaptable framework for collection and normalization of structured data” provided a framework for collection and normalization of structured data and additional features (pre-processing), employed in the context of LeanBigData Financial Analytics Use Case. Paper #6 “Sentiment Analysis over politics-related big Twitter datasets” presented a novel sentiment analysis model for Social Media posts (Twitter) using the technique of n-gram graphs. This document reflected the work done in the context of the Social Media Analytics Use Case, evaluated over a big dataset of tweets posted during the pre-election period in Greece during September 2015.

Finally, Paper#7 provided a Targeted Advertisement case-study as a result of LeanBigData Targeted Advertisement Use Case, using an AdServer created by PT relying on in-house developed tools for handling the high-throughput stream of data and dealing with analysis and visualisation. Results have shown that current LeanBigData platform outstands PT in terms of throughput, although response time using LeanBigData platform in still bellow PT results. LeanBigData roadmap includes several improvements in the key-value data store and SQL engine which are expected to positively impact on these results.

D9.2.3


3. POLYGLOT DATABASES Workshop

This chapter provides a full report of the Third Workshop information, agenda and results. POLYGLOT DATABASES Workshop was hosted in New Delhi, India on September 9, 2016 conjunction with the 42nd International Conference on VERY LARGE DATA BASES (VLDB 2016). As mentioned above, the workshop had to be merged with the second Workshop on Big Data Open Source Systems (BOSS 2016) workshop, so a joint workshop was held in the context of VLDB, were LeanBigData was a plenary session within the BOSS workshop. BOSS 2016 gave a deep-dive introduction into several active, publicly available, open-source

systems. A screenshot of the workshop website can be seen in Figure 5:

Figure 5: BOSS website main page

3.1. Workshop Committee

The Program Committee of the POLYGLOT DATABASES workshop is listed in Table 3,

including the coordinator of LeanBigData:

Name Affiliation

Prof. Marta Patiño-Martinez UPM, Spain

Dr. Patrick Valduriez INRIA, France

Table 3: POLYGLOT DATABASES Program Committee

3.2. Website and Agenda

POLYGLOT DATABASES workshop was disseminated through a dedicated web page:

http://lsd.ls.fi.upm.es/polyglot2016

A screenshot of this page can be observed in the following Figure 3:


D9.2.3


Figure 6: POLYGLOT DATABASES Workshop webpage

During the workshop, 4 technical presentations have been presented in the context of 4 different tasks of a full day, as can be observed in the workshop agenda:

A list of all POLYGLOT DATABASES paper abstracts can be found in Annex II.

Νο. Workshop Session

Paper Title Authors

1

Sep’ 9

Task 1

(12:00 - 12:30)

Big Data processing using Polybase Karthik Ramachandra (Microsoft

Gray Systems Lab)

2

Sep’ 9

Task 2

(14:00 - 14:30)

Multistore Systems: Retrospection on

CloudMdsQL

Jose Pereira (Univ. do Minho &

INESC)

3

Sep’ 9

Task 3

(14:30 - 15:00)

Exploiting the data center in

contemporary commodity boxes: The

scaling-in approach

Jignesh Patel (Univ. of

Wisconsin-Madison)

4

Sep’ 9

Task 4

(15:00 - 15:30)

LeanBigData: Blending OLTP and

OLAP to Deliver Real-Time Analytical

Queries

Ricardo Jimenez-Peris

(LeanXcale)

Table 4: POLYGLOT DATABASES Workshop Agenda

D9.2.3


3.3. Content and Results

The joint workshop was attended by 50+ top researchers interested in polyglot systems and large data bases. High profile participants presented their work, as for example Jignesh Patel, from the David Dewitt’s group and Karthik Ramachandra from the Microsoft’s database group.

Presentation #4 “Blending OLTP and OLAP to Deliver Real-Time Analytical Queries” done by Ricardo Jimenez-Peris provided some general outcomes and advancements achieved by LeanBigData concerning Elastic parallel-distributed query engines with OLTP + OLAP support as well as Efficient use of hardware resources and High scalable and efficient CEP.

D9.2.3


4. Summary This deliverable reports on the follow-up workshops organised by LeanBigData, after the success of the 1st Public Workshop.

The 1st LeanBigData joint Workshop DataDiversityConvergence 2016 [7] took place in Rome, Italy on 24 April 2016 with 11 papers presented in a full day workshop. The 2nd Public Workshop RTPBD 2016, took place in Heraklion, Greece on June 9, 2016. 18 papers were presented in a full day workshop following and extending results from the 1st Workshop.

Finally, a 3rd joint Workshop “POLYGLOT DATABASES” was organized in New Delhi, India on September 9, 2016, as a session of the BOSS 2016 workshop of VLDB 2016 conference, with 4 relevant presentations.

The aforementioned workshops illustrated the work done in the LeanBigData project and its Use Case scenarios validation. In the context of the papers produced, important open issues in big data analytics were addressed, such as scalability and efficiency, and innovative big data management technologies for processing streaming events and different workloads over stored data were provided. Research work performed on Normalization of Structured Data and Sentiment Analysis explored new challenges that emerged during the project lifetime, concluding into useful outcomes in the area of big data management and visualization.

D9.2.3


5. References [1]. RTPBD 2016, http://2016.discotec.org/index.php?MG=60&Mid=34&sub1=1

[2]. DISCOTEC 2016, http://2016.discotec.org/

[3]. CoherentPaaS project, funded from the European Union's 7th FP for research, technological

development and demonstration under grant agreement no 611068

[4]. Polyglot Databases Session, http://lsd.ls.fi.upm.es/polyglot2016

[5]. BOSS’ 2016, http://boss.dima.tu-berlin.de/

[6]. VLDB 2016, http://vldb2016.persistent.com/

[7]. DataDiversityConvergence 2016,

http://closer.scitevents.org/DataDiversityConvergence.aspx?y=2016


http://2016.discotec.org/


http://boss.dima.tu-berlin.de/

http://vldb2016.persistent.com/

http://closer.scitevents.org/DataDiversityConvergence.aspx?y=2016

D9.2.3


Annex I. RTPBD Workshop Paper Abstracts In Table 5:, a list of the RTPBD papers titles, abstracts and authors are reported.

Paper #

Publication Title

Abstract Author(s)

1 CoherentPaaS

Vision

There has been a blooming of new data stores

addressing new challenges in data management of

structured, semi-structured and unstructured data in

the last years. In the semi-structured segment, many

different kinds of NoSQL data stores have emerged

proposing new data models and associated query

languages and/or APIs appropriate for them, such as

document-oriented data stores, key-value data stores

and graph databases. On the structured data world,

new technologies have emerged such as NewSQL,

columnar data warehouses or in-memory databases.

This adoption of new data stores store has led to a

proliferation of data stores at organizations raising

interesting challenges. On one hand, NoSQL data

stores were born in a world where scalability was

considered a key attribute and many of them attained

this scalability by dismissing the transactional

properties, since it was the main bottleneck in data

management technology preventing from achieving

scalability. This design decision has enabled many

NoSQL data stores to achieve different degrees of

scalability but at the cost of losing consistency in the

advent of failures and/or concurrent access. On the

other hand, having multiple data stores results in the

so-called polyglot persistent environments that create

two interesting problems. The first one happens when

a business action requires updating data across

multiple data stores, which lacks the transactional

consistency provided by OLTP databases, resulting

in a loss of consistency when conflicting accesses

happen or a failure occurs at one or more nodes. The

second one is that exploiting the data in a polyglot

persistence environment is an extremely hard task

unreachable for most organizations. The reason is

that different data stores provide different query

languages or APIs. This force to program a query

engine at the application level that is obviously not

feasible for most organizations or to move all the

data to a data warehouse forcing to put a relational

schema to data that is schema-less such as the one

stored on NoSQL data stores. It turns out that the

motivation to introduce data stores is precisely to

prevent to put under a relational schema data not

suited for such model and requiring the flexibility of

other data models such as documents, key-value or

graphs. CoherentPaaS comes to solve the three above

problems. On one hand, it solves the pains related to

updates across different data sores by providing a

Ricardo Jiménez Peris

D9.2.3


holistic transactional manager that enriches NoSQL

and other transaction-less data stores with

transactional semantics without affecting their

scalability and to execute holistic or global

transactions across any subset of data stores

integrated into the CoherentPaaS platform. On the

other hand, it solves the problem related to queries in

polyglot persistence environments by enabling the

combination of SQL with the native query languages

and naive APIs to make queries across data stores

with arbitrary data models and their query processing

framework.

2 LeanBigData

Vision

LeanBigData targets at building an ultra-scalable and

ultra-efficient integrated big data platform addressing

important open issues in big data analytics. Current

big data infrastructure scale to large amounts of data

and system sizes, however, in a very inefficient way

consuming disproportionally high resources per data

item processed. Furthermore, the lack of integrated

big data management technologies to process

streaming events and different workloads over stored

data results in the complexity to integrate disparate

big data systems and the overhead of copying data

across systems. What is more, data analysis cycles to

refine queries and identify facts of interest take

hours, days, or weeks, whereas business processes

demand today shorter cycles. LeanBigData is

addressing these issues by delivering ultra-scalable

big data management systems such as NoSQL key-

value data store, a distributed CEP system, and a

distributed SQL query engine, by providing an

integrated big data platform to avoid the

inefficiencies and delays introduced by current ETL-

based integration approaches of disparate

technologies, and by supporting an end-to-end big

data analytics solution removing the main sources of

delays

Ricardo Jiménez Peris

3 ActivePivot

improvements

from

CoherentPaaS

Concurrency control is one of the most important and

performance- critical feature of a database. Up to its

version 4, ActivePivot was using a simple read-write

lock to enforce mutual exclusion between queries

and transactions. While functionally correct this

technique causes various performance issues, the

most visible one being that long-running queries

prevent new data from being inserted in the database.

In this paper, we will see how multi-version

concurrency control has been implemented in

ActivePivot to solve these problems. We will also

show how this new mechanism has been leveraged to

implement both as-of and what-if analysis, two new

defining features of ActivePivot.

Francois Sabary

D9.2.3


4 CQE: A

middleware to

execute

queries

accross

heterogeneous

databases

The data management world has been evolving

towards a large diversity of databases. This blooming

of data management technologies, where each

technology is specialized and optimal for specific

processing, has led to a no one size fits all situation.

Consequently, software applications usually need to

use different database technologies simultaneously.

In this situation, software developers need to deal

with several data integration issues because these

cannot apply SQL queries across databases. In order

to solve this gap, we present a middleware called

Common Query Engine (CQE) based on an open

source technology called Apache Derby to execute

SQL-like queries to integrate results from different

data management technologies.

Raquel Pau

5 Prepared

Scan:

Efficient

Retrieval of

Structured

Data from

HBase

NoSQL systems are getting more popular and, for

this reason, they are constantly the target of

performance evaluation and optimization. The ability

of these systems to scale better than traditional

relational databases, so much for being schema-less,

motivates a large set of applications to migrate their

data to these systems, even without the intention to

exploit schema flexibility provided by NoSQL

systems. However, accessing structured data in

schema-less systems is costly due to provided

flexibility, incurring in an increase of bandwidth

usage and thus overall performance degradation.

In this paper, we analyse this cost in Apache HBase,

a distributed, opensource and non-relational

database. We propose a new operation named

Prepared Scan that optimizes access to data which

follows a well-defined and regular schema by taking

advantage of this in the context of applications.

We evaluate its performance using an industry

standard benchmark for both non-relational and

relational databases. The evaluation shows that

Prepared Scan improves throughput up to 25% while

reducing network bandwidth consumption up to

20%.

Francisco Neves, José

Pereira, Ricardo

Vilaça and Rui

Oliveira

6 Big Data

Stream

Clustering

Algorithms

Empirical

Evaluation

The nature of the data passing through data-driven

organizations

is changing dramatically. Despite clear technological

advances, analyzing big data and extracting valuable

knowledge is still a great challenge. Big data needs

big storage and highly frequent volumes of data

streams make operations such as analytical

operations, process operations, retrieval operations

real difficult and hugely time consuming. Due to

evolving nature of the data, unsupervised methods

are recommended. Big data summarization requires

lesser storage and extremely shorter time to get

processed and retrieved. Stream clustering is an

efficient strategy against mining of evolving big data.

In this article, we evaluated most popular stream

Annie Ibrahim Rana ,

Giovani Estrada, and

Marc Solé

D9.2.3


clustering techniques using the simulated data and

the data collected from our test-bed, and presented

our evaluation results.

7 The

LeanBigData

Data

Collection

Framework

An innovate

and adaptable

framework for

collection and

normalization

of structured

data

The data collection for eventual analysis is an old

concept that today receives a revisited interest due to

the emerging of new research trend such Big Data.

Furthermore, considering that a current market trend

is to provide integrated solution to achieve multiple

purposes (such as ISOC, SIEM, CEP, etc.), the data

became very heterogeneous. In this paper is

presented an innovative and adaptable framework for

collection and normalization of structured data,

describing the approach used to collect structured

data and the additional features (pre-processing)

provided with it.

Luigi Coppolino ,

Luigi Sgaglione ,

Gaetano Papale , and

Ferdinando Campanile

8 Detecting

Performance

Degradation

with System

Level Metrics

With recent consolidation trends, it is today typical to

run multiple applications on servers within

datacenters. Due to the diversity of applications and

the lack of generic models for comprehension of

application behavior, data centers (DC) operators and

providers cannot easily infer how well

an application is behaving, with respect to user-

expectation. Therefore, they tend to resort to

application-level metrics that are not always easy to

obtain via instrumentation.

In this work we examine how system-level metrics

can be used to infer application behavior and

specifically performance degradation when mixed

workloads are deployed on consolidated servers. We

introduce a lightweight and inexpensive monitoring

framework that detects a performance degradation of

unknown applications by using only system level

metrics. We examine more than

700 hardware and software level metrics and identify

correlations with various application classes. Using

this analysis, we reduce these metrics to 24 useful

metrics. Then, we use automated profiling to

establish acceptable baselines for applications and we

monitor applications dynamically during execution to

identify any significant degradation. We evaluate our

approach on ten benchmarks of different categories

(cpu, memory, I/O, and network intensive

applications) and we find that our approach is able to

identify changes in application behavior without any

specific knowledge about the applications.

Dimitris Ganosis,

Yannis Sfakianakis ,

Manolis Marazakis,

and Angelos Bilas

9 Transactional

Support for

Cloud Data

Stores

In the last decade it has been observed an exponential

explosion of generated user data over the internet.

Traditional data management solutions such as

relational databases are simply not able to process

this large amount of data in a reasonable time. A new

need of high scalable data management tools

Ricardo Jiménez Peris,

Marta Patiño Martínez

and Ivan Brondino

D9.2.3


emerged, the cloud data stores. These technologies

are able to process petabytes of data but with an

important trade-off: the lack of transactional

consistency. The scenario becomes even more

complex for those applications whose building

blocks are on top of a hybrid data store ecosystem.

This works presents a novel protocol to provide

transaction semantics on top of heterogeneous data

stores transparently to applications.

10 Satisfying

Telecom and

IoT big data

application

requirements

using multiple

data stores in

a coherent

way

As services are moving onto the cloud and dataset

volumes are exploding, data management becomes

extremely demanding. Multiple data store

technologies emerged (noSQL, in-memory,

columnar) to address specific needs. When all needs

are brought together in an application, big data

solutions should respond efficiently. CoherentPaaS

project addresses these issues by providing a rich

PaaS with a diversity of data stores and data

management technologies optimized for particular

tasks and work-loads. CoherentPaaS integrates data

stores, and complex event processing systems with

holistic transactional coherence and a common query

language to enable data correlation across stores. To

assess project results, two use cases are implemented,

both addressing different needs of telecoms: a price

simulation application and a platform to deploy IoT

services such as vehicle monitoring. Combination of

different data stores and integration with real-time

stream queries is brought at the development focus.

Benefits and achievements are measured, assessed

and reported.

Vassilis Spitadakis,

Dimitrios Bouras,

Yiorgos Panagiotakis

and Apostolos

Hatzimanikatis

11 CoherentPaaS

- Real-Time

Network

Performance

Analysis in a

Telco

Environment

Use Case

In this paper we present the Real-Time Network

Performance Analysis in a Telco Environment use

case for CoherentPaaS project. It is based on Altice

Labs product Altaia, which aims to detect network

problems before any degradation or unavailability of

services occur, by actively supervising it. How-ever,

monitoring the whole network implies analyzing big

amounts of data in real-time and the current solution

does not provide the required degree of scalability.

The plan is to use CoherentPaaS to provide a rich

Platform-as-a-Service that supports several data

stores accessible via a uniform programming model

and language and complying to demanding delay,

throughput and data volume requirements. All the

required transformation will be presented, as well as

the expected associated benefits.

Luis Cortesão and

Diogo Regateiro

12 A Taxonomy

of Multistore

Systems

Building cloud data-intensive applications often

require using multiple data stores (NoSQL, HDFS,

RDBMS, etc.). However, the wide diversification of

data store interfaces makes it difficult to integrate

data from multiple data stores. This problem has

motivated the design of a new generation of systems,

Carlyna Bondiombouy

and Patrick Valduriez

D9.2.3


called multistore systems, which provide integrated

or transparent access to a number of cloud data stores

through one or more query languages. In this paper,

we give taxonomy of multistore systems, based on

their architecture, data model, query languages and

query processing techniques. To ease comparison, we

divide multistore systems based on the level of

coupling with the underlying data stores, i.e. loosely-

coupled, tightly-coupled and hybrid.

13 Targeted

Advertisement

case-study: a

LeanBigData

benchmark

In this paper we present Targeted Advertisement

case-study, a big data case-study for LeanBigData

project. PT (Portugal Telecom) sells multi-platform

ads online covering the whole spectrum of web,

mobile and TV, allowing advertisers to define their

own campaigns, set their campaign goals and budget,

their choice of paid words, as well as many other

constraints including geographic and demographic of

the targeted customers. To reliably provide efficient,

contextualized and targeted advertisements to final

users, the current architecture of PT AdServer relies

on in-house developed tools for handling the high-

throughput stream of data and to deal with analysis

and visualization. We present a benchmark study

performed on LeanBigData platform tested with real

PT needs in terms of data throughput and scalability.

Jorge Teixeira ,

Miguel Biscaia , Iván

Brondino and Mario

Moreira

14 Sentiment

Analysis over

politics-

related big

Twitter

datasets

In this paper, we propose a novel sentiment analysis

technique for messages exchanged in Twitter, related

political debates, using the novel technique of n-gram

graphs in the context of a big data analysis platform

(LeanBigData). Towards this direction, we make use

of a big dataset of tweets posted during the pre-

election period in Greece on September 2015. Tweets

are divided among three sentiment classes, according

to specific patterns recognized (emoticons) to create

the respective n-gram graphs and calculate node

similarities, so as to train multiple classifiers and

compare their behavior. Also, we compare the

performance of various sub-sets to come up with the

optimal patterns to aggregate tweets for a concrete

analysis. Classification experiments provided

promising results, comparing to the real sentiment of

the testing dataset, as described in the last section of

the paper.

Fotis Aisopos, Vrettos

Moulos, John Violos,

Theodora Varvarigou,

Pavlos

Kranas, Sotiris

Stamokostas, Dimos

Kyriazis, Adreas

Menychtas, Kleopatra

Konstanteli, George

Vafiadis, Athanasia

Evangelinou, Christina

Santzaridou,

Vasileios

Anagnostopoulos, and

Anna Gatzioura

Nikiforos Makrinakis

15 STREAM-

OPS: a

Streaming

Operator

Library

Nowadays applications consume huge amount of live

data with the requirement of real-time processing.

Complex Event Processing (CEP) represents a

promising technology to allow these applications to

process large amount of information in real-time.

Some of the available CEP systems, like Apache

Storm, provide a programmatic model for the

definition of the continuous queries used to process

data on the fly. This paper presents STREAM-OPS, a

library of streaming operators written in JAVA that is

Ricardo Jiménez-

Peris, ValerioVianello

and Marta Patiño-

Martínez

D9.2.3


designed to ease the process of streaming query

definition. In the paper we also present an evaluation

of the STREAM-OPS library when integrated into

two CEP systems developed in the context of

CoherentPaaS and LeanBigData European projects.

16 Transactional

MongoDB

The wide adaptation of NoSQL data-stores and

among them the document data-stores, because of

their rich functionalities and their performance, has

led to the need for also providing from them

transactional semantics. MongoDB, a very popular

document data-store, require from the application

developer to implement his own two phase commit

protocol in case the application requires ensuring

ACID properties. In this paper, it is presented an

extension of the official MongoDB client for

providing transactional semantics and Snapshot

Isolation.

Pavlos Kranas , Sotiris

Stamokostas, George

Vafiadis, Athanasia

Evangelinou,

Christina Santzaridou,

Alexandros Psychas,

Georgios

Palaiokrassas,

Achilleas

Marinakis, Fotis

Aisopos and Vrettos

Moulos

17 Distributed

Processing

and

Transaction

Replication in

MonetDB -

Towards a

Scalable

Analytical

Database

System in the

Cloud

Thanks to its flexibility (i.e. new computing jobs can

be set up in minutes, without having to wait for

hardware procurement) and elasticity (i.e. more or

less resources can be allocated to instantly match the

current workload), cloud computing has rapidly

gained much interests from both academic and

commercial users. Increasingly moving into the

cloud is a clear trend in the software developments.

To provide its users a fast in-memory optimized

analytical database system with all the conveniences

of the cloud environment, we embarked upon

extending the open-source column store database

MonetDB with new features to make it cloud-ready.

In the paper, we elaborate the new distributed and

replicated transaction features in MonetDB. The

distributed query processing feature allows MonetDB

to horizontally scale-out to multiple machines; while

the transaction replication schemes increase the

availability of the MonetDB database servers.

Ying Zhang, Dimitar

Nedev, Panagiotis

Koutsourakis and

Martin Kersten

18 Gestural

Interaction

with Data

Center 3D

Visualizations

This paper reports on ongoing work regarding

gestural interaction with 3D visualizations of large-

scale data centers in the context of large scale data

center infrastructure management. The visualization

renders a virtual area of real data centers preserving

the actual arrangement of their servers and displays

their current state, including several condition

indicators, updated in real time, as well as a color-

coding scheme for the current servers’ condition

referring to a scale from normal to critical. This

paper focuses on the interaction requirements

regarding the exploration of the 3D visualization.

Towards this direction, the use of gestures as a means

of natural interaction is suggested in order to provide

intuitive, natural and rich interaction with the

demanding environment of a data center.

Giannis Drossis,

Chryssi Birliraki,

Nikolaos Patsiouras,

George Margetis and

Constantine

Stephanidis

Table 5: RTPBD Workshop Paper Titles and Abstracts

D9.2.3


Annex II. POLYGLOT DATABASES Workshop Paper Abstracts In Table 6, a list of the POLYGLOT DATABASES papers titles, abstracts and authors are reported.

Paper #

Publication Title

Abstract Author(s)

1 Big Data

processing

using

Polybase

Abstract: To make good decisions, business decision

makers need to analyze both relational data and other

data that is not structured into tables – notably, data

stored in Hadoop and other similar Big Data systems.

This is difficult to do unless there exists an efficient

way to process queries that access data across these

different types of data stores. PolyBase bridges this

gap by operating on data that is external to Microsoft

SQL Server. PolyBase is a technology that accesses

and combines both non-relational and relational data,

all from within SQL Server. It allows queries on

external data in Hadoop or Azure blob storage. The

queries are optimized to push computation to Hadoop

when beneficial. The talk will give an overview of

Polybase and describe the architecture of Polybase in

SQL Server 2016. Some of the key technical

challenges and design approaches will also be

discussed.

Karthik Ramachandra

2 Multistore

Systems:

Retrospection

on

CloudMdsQL

The blooming of different cloud data management

infrastructures has turned multistore systems to a

major topic in the nowadays cloud landscape. In this

paper, we give an overview of the Cloud

Multidatastore Query Language (CloudMdsQL), and

the implementation of its query engine.

CloudMdsQL is a functional SQL-like language,

capable of querying multiple heterogeneous data

stores (relational, NoSQL, HDFS) within a single

query that can contain embedded invocations to each

data store’s native query interface. The major

innovation is that a CloudMdsQL query can exploit

the full power of local data stores, by simply

allowing some local data store native queries (e.g. a

breadth-first search query against a graph database)

to be called as functions, and at the same time be

optimized, e.g. by pushing down select predicates,

using bind join, performing join ordering, or

planning intermediate data shipping.

Jose Pereira

3 Exploiting the

data center in

contemporary

commodity

boxes: The

scaling-in

approach

Modern servers pack enough storage and computing

power that just a decade ago was spread across a

modest-sized cluster. In addition, we are on a

technological roadmap in which the storage and

compute densities of individual server nodes will

continue to increase at a faster rate that the networks

that connect the nodes. Thus, we must complement

methods that focus on "scaling-out" by also

developing methods to "scale-in" to fully exploit the

Jignesh Patel

D9.2.3


hardware capabilities that are packed in each server

node. This is especially true for an important class of

real-time in-memory analytic data applications. The

recent Apache-incubated Quickstep project focuses

on this scaling-in aspect. Quickstep uses novel

methods for organizing data (including columnar and

hybrid storage organization), template

metaprogramming for vectorized query execution,

and a query execution paradigm that separates

control-flow from data-flow. Collectively, these

methods produce more than an order-of-magnitude

performance improvement over many existing open-

source platforms.

4 Blending

OLTP and

OLAP to

Deliver Real-

Time

Analytical

Queries

Traditionally, OLTP and OLAP workloads have been

served by different kinds of databases systems,

transactional databases and data warehouses. This

separation has resulted in having to organize a

process to copy the data from the operational

database into the data warehouse known as extract-

transform-load (ETL). This process is estimated to

cost 80% of the budget of doing business analytics.

LeanXcale is a NewSQL database that scales

transactional processing in a linear manner to 100s of

nodes. With this it provides an ultra-scalable OLTP

database. Thanks to its ability to scale the OLTP

engine as much as needed an OLAP engine has been

built that works over the operational data delivering

in this way real-time analytical queries.

Ricardo Jimenez-Peris

Table 6: POLYGLOT DATABASES Workshop Paper Titles and Abstracts

d9.4 second workshop report - lean big dataversion: 1.0 status: final author(s): fotis aisopos iccs...

Documents