ditas d5.2 - integration of ditas and case studies validation report · 2018. 12. 11. · d5.2...

D5.2 Integration of DITAS and case studies validation report

Project Acronym DITAS Project Title Data-intensive applications Improvement by moving

daTA and computation in mixed cloud/fog environ-mentS

Project Number 731945 Instrument Collaborative Project Start Date 01/01/2017 Duration 36 months Thematic Priority Website:

ICT-06-2016 Cloud Computing http://www.ditas-project.eu

Dissemination level: Public Work Package WP5 Real world case studies and integration Due Date: M18 Submission Date: 31/07/2018 Version: 1.0 Status Final Author(s): Aitor Fernández (IDEKO), Borja Tornos (IDEKO), Javier Es-

cartín (IDEKO), Eleonora Ciceri (OSR), Mariet Nouri Janian (OSR), Paola Aurucci (OSR), Andrea Micheletti (OSR), Grigor Pavlov (CS)

Reviewer(s) Vrettos Moulos (ICCS), Pierlugi Plebani (POLIMI), David García Pérez (ATOS)

This project has received funding by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 731945

© Main editor and other members of the DITAS consortium

2 D5.2 Integration of DITAS and case studies validation report

Version History Version Date Comments, Changes, Status Authors, contributors, reviewers 0.1 05/05/2018 Initial version Aitor Fernández (IDEKO)

0.2 09/05/2018 Changes to the document structure

Borja Tornos (IDEKO)

0.3 12/05/2018 Added CI system Borja Tornos (IDEKO) 0.4 19/05/2018 Added OSR use case Eleonora Ciceri (OSR) 0.5 19/05/2018 Some more details to CI sys-

tem Borja Tornos (IDEKO)

0.6 08/06/2018 Changes IDEKO’s use case structure Aitor Fernández (IDEKO)

0.7 12/06/2018 IDEKO’s use case first version Aitor Fernández (IDEKO) 0.8 09/07/2018 IDEKO’s use case second

version Aitor Fernández (IDEKO)

0.9 10/07/2018 Fixed style issues, figures and tables Aitor Fernández (IDEKO)

0.10 11/07/2018 Execution Environment testbed Grigor Pavlov (CS)

0.11 19/07/2018 Format improvements Borja Tornos (IDEKO) 0.12 20/07/2018 Added executive summary Borja Tornos (IDEKO), Elenora

Ciceri (OSR) 0.13 22/07/2018 Ready for review version Aitor Fernández (IDEKO) 0.14 30/07/2018 Reviewed version and up-

dated from users Pierluigi Plebani (POLIMI), Vret-tos Moulos (ICCS), Borja Tornos (IDEKO), Elenora Ciceri (OSR), Aitor Fernández (IDEKO), Grigor Pavlov (CS)

1.0 31/07/2018 Final version for submission David García Pérez (ATOS)



Contents Version History ...................................................................................................................... 2 List of Figures......................................................................................................................... 4 List of tables .......................................................................................................................... 5 Executive Summary ............................................................................................................ 6 1 Introduction .................................................................................................................. 7

1.1 Purpose ................................................................................................................. 7 1.2 Glossary of Acronyms ........................................................................................ 8

2 DITAS components .................................................................................................... 10 2.1 Logical diagram (SDK components) ............................................................ 10 2.2 Integration Diagram ........................................................................................ 10 2.3 Deployment diagram (Execution Environment) ........................................ 11

3 Continuous integration system ............................................................................... 13 3.1 Architecture ....................................................................................................... 13 3.2 Jenkins pipeline ................................................................................................. 14 3.3 Building stage .................................................................................................... 15 3.4 Artifacts............................................................................................................... 16 3.5 Automated API testing .................................................................................... 17 3.6 Integration Tests ................................................................................................ 17

4 Testbeds....................................................................................................................... 19 4.1 Core Components Testbed ........................................................................... 19

4.1.1 Introduction ................................................................................................... 19 4.1.2 Architecture and execution flow .............................................................. 19

4.2 Execution Environment testbed .................................................................... 20 4.2.1 IDEKO use case Execution Environment testbed .................................. 20 4.2.2 OSR use case Execution Environment testbed ...................................... 21

5 Case studies and DITAS framework ....................................................................... 23 5.1 OSR use case ..................................................................................................... 23

5.1.1 Scenario 1 ...................................................................................................... 23 5.1.2 Scenario 2 ...................................................................................................... 27 5.1.3 Scenario 3 ...................................................................................................... 34 5.1.4 Application .................................................................................................... 41

5.2 IDEKO use case ................................................................................................. 41 5.2.1 Contextualization ......................................................................................... 41 5.2.2 Machine simulator ....................................................................................... 42 5.2.3 Hardware architecture ............................................................................... 44 5.2.4 Data sources ................................................................................................. 45



5.2.5 Virtual Data Containers for the use case ................................................ 46 5.2.6 Application .................................................................................................... 48

6 Conclusions................................................................................................................. 56 References ......................................................................................................................... 57

List of Figures Figure 1 - Logical diagram of core components ....................................................... 10 Figure 2 – Integration diagram of core components ................................................ 11 Figure 3 – Deployment diagram .................................................................................... 12 Figure 4 – Continuous Integration system architecture ............................................ 13 Figure 5 – Jenkins pipeline stages .................................................................................. 14 Figure 6 – Building using Docker ..................................................................................... 16 Figure 7 – DITAS Docker HUB ........................................................................................... 17 Figure 8 – From development to testbed process ..................................................... 19 Figure 9 – Architecture and execution flow ................................................................ 20 Figure 10 – IDEKO Execution Environment testbed diagram ................................... 21 Figure 11 – OSR Execution Environment testbed diagram ....................................... 22 Figure 12 – Scenario 1 actors .......................................................................................... 24 Figure 13 – Doctor may be in either of the hospitals ................................................. 28 Figure 14 – Mapping secret IDs and anonymized data in Cloud ........................... 32 Figure 15 – Scenario 2 actors .......................................................................................... 34 Figure 16 – Soraluce FS milling boring machine .......................................................... 42 Figure 17 – IDEKO’s use case hardware architecture ............................................... 45 Figure 18 – IDEKO’s use case data sources ................................................................. 45 Figure 19 – Data flow for the Cloud API ....................................................................... 46 Figure 20 – IDEKO’s use case VDC strategy approaches ......................................... 47 Figure 21 – Segmentation process flow diagram ....................................................... 49 Figure 22 – Anomaly detection system process flow diagram................................ 50 Figure 23 – Anomaly detection system process flow diagram................................ 51 Figure 24 – Anomaly trigger cause ................................................................................ 52 Figure 25 – Spindle trajectory during operation ......................................................... 52 Figure 26 – Operation position in piece outline .......................................................... 52 Figure 27 – IDEKO’s use case components architecture .......................................... 53 Figure 28 – IDEKO’s use case app components deployment diagram ................ 55



List of tables Table 1 - Acronyms ............................................................................................................. 9 Table 2 - Use case 1.1.- Collect patient’s clinical information, for treating an emergency ......................................................................................................................... 26 Table 3 - Use case 1.2.- Collect average value for research purposes ................. 26 Table 4 - Use case 2.- Collect patient’s clinical information, for emergency treatment purposes .......................................................................................................... 31 Table 5 - Use case 3 Collect BMI-cholesterol distributions of individuals, for research purposes ............................................................................................................ 38 Table 6 - Indicators for a Soraluce FS machine. ......................................................... 44 Table 7 - IDEKO’s use case blueprints methods .......................................................... 48



Executive Summary The objective of this deliverable is two-fold: on the one side to give a thorough description to the integration system used for the components of the project and on the other side to demonstrate the DITAS framework integration over the two case studies: the e-Health use case with OSR and the Industry 4.0 case study with IDEKO.

This document covers the selected strategy for integration of the DITAS compo-nents, where the two logical groups of the framework components are explained (SDK and runtime), and the Jenkins-based continuous integration (CI) model is detailed. This is one of the main pillars of the development, and on this docu-ment, we show the architecture of the CI system introduced on the deliverable 5.1 [2] and a detailed view of how all the process works, where does every stage run and where everything gets deployed.

Two testbeds have been deployed to test the framework, a simple Core Com-ponents testbed, where the SDK runs and allows framework users to publish, search and use VDC blueprint, and a runtime testbed where the Kubernetes clus-ter is deployed and which uses cases’ can use for demonstrating the framework.

Focusing the case studies, the e-Health use case with OSR supports the activities of the hospital staff (both from the diagnostic side and from the research side) with the retrieval of structured and unstructured information coming from the pa-tients’ EHRs. Here, importance is given to the privacy and security side, as pa-tients’ data can be treated only according to the consent patients gave to the physicians and researchers. Thus, we will study two case studies: i) the case in which patients’ data have to be retrieved (fast and with high availability) to treat an emergency scenario; ii) the case in which patients’ data have to be retrieved (properly anonymized) for research purposes. These very different scenarios (both in terms of what the hospital can offer as a data administrator and in terms of requirements that users would like to achieve as data consumer) show how DITAS is a key framework for handling health data in different context and for the exploitation in different applications.

The Industry 4.0 use case with IDEKO uses the DITAS framework to develop an advanced technical service application that will help facing technical service common drawbacks. Using three different machine data sources, it is shown how IDEKO along with the DITAS framework manages different VDC methods to serve this data, and how a decoupled-from-machine application uses this data for three different services; segmentation, anomaly detection and anomaly analy-sis.

The DITAS framework has been tested on both use cases, and the purpose of the VDC’s abstracting the application developer from the data sources have been proved successfully. In any case, it's important to point out that this deliverable doesn't include a quantified validation for each use case. A real quantified vali-dation will be provided as an annex on September.



1 Introduction

1.1 Purpose One of the DITAS project aim is to ensure the quality of the product along with a homogeneous development work for different components. Having a continu-ous integration system ensures more accurate results, detects deficiencies early on in development, defects are typically smaller, less complex and easier to re-solve. The DITAS CI system was explained in D5.1 [2]; This document gives a pro-found explanation of how the pipeline was built, how each stage of the pipeline works, how different types of test are done, and also a deeper specification of the CI architecture. This deliverable also puts the focus on the case studies, where the DITAS project framework is demonstrated on two different real environments.

The OSR use case focuses on the e-Health context, where data coming from the hospital patients’ Electronic Health Records are used for two purposes:

i) answering the needs of a medical doctor that has to treat a patient for a specific pathology/condition

ii) answering the needs of a medical researcher that has to conduct a sta-tistical study on the patients’ population to extract some metrics.

The eHealth case studies rise from the contribution in Deliverable D5.1 [2] and iterates over three different scenarios, from a very simplistic one in which the medical doctor and the researcher are both located in the hospital, to a quite complex one in which several environments (untrusted clouds, trusted clouds, edge in hospitals and edge in untrusted research centers) are considered. When such different environments are taken into account, it becomes clear how a specific data management procedure that keeps an eye to the privacy and security guarantees is vital for the use case itself. Indeed, health data are amongst the most sensitive, and they have to be treated in accordance with the consent given by patients and the most recent regulations (see GDPR, i.e., General Data Protection Regulation, currently active in all EU Countries). Pa-tients’ data can, for instance, be fully treated by doctors in an ER when a patient is in an emergency situation, while have to be fully anonymized when used by external researchers (i.e., researchers that are not co-holders of data) so that the re-identification of patients is not consented. In this picture, the DITAS frame-work becomes vital, as:

i) Data coming from/getting to the hospital are of different types and in most cases unstructured.

ii) There is some performance, data utility, privacy and security guarantees that have to be put in place to handle properly patients’ sensitive data.

iii) Large volumes of data are generated in large hospitals like Ospedale San Raffaele and an automatic way of extracting knowledge from them is very important.

This deliverable discusses in details the development of the three scenarios pre-sented in D5.1 [2], together with their case studies (describing how actors are supposed to interact with the system) and the expected methods for accessing data.

The IDEKO use case focuses on the Industry 4.0 context, where managing the technical service to have a detailed view of the state of the machine anytime and anywhere is a growing challenge for manufacturing companies around the



globe. Rising response and diagnostics time will adversely affect customer satis-faction and travelling costs will directly impact on the vendors company's profit-ability. Developing a technical service-oriented application with the DITAS framework will help facing technical service common drawbacks explained above. This deliverable covers this application in deep detail, showing a low-level architecture of the application, detailing the Edge and Cloud data sources used, or explaining the methods used from the DITAS framework by the application among other things.

For each use case we have two different runtime testbeds, IDEKO’s use case pro-vides as 3 Edge devices associated to 3 machines, and OSR provides two differ-ent data sources (that answers to the needs of Scenario 1 - Running example, see section 5.1.1) describing blood tests and patients’ biographical data. These different testbeds for each use case was agreed only for the first release.

This deliverable is structures as follows. Section 2 details the DITAS components with different diagrams to understand the integration of the same, section 3 de-scribes the Continuous integrated system we use for developing and integrating the components described in section 2. Section 4 covers the Testbeds, where the core components of the SDK reside and the Execution Environment components for each of the case studies run, Finally, section 5 details the IDEKO and OSR case studies.

1.2 Glossary of Acronyms Acronym Definition API Application Programming Interface BMI Body Mass Index BP Blueprint CI Continuous Integration CNC Computer numerical control CPU Central Processing Unit D Deliverable DAS Dynamics Active Stabiliser DNS Domain Name System EHR Electronic Health Record ER Emergency Room GB Gigabyte GDPR General Data Protection Regulation GHZ Gigahertz HDD Hard Disk Drive JRE Java Runtime Environment OAS OpenAPI Specification OS Operative System PAR Periodic Activity Report PC Project Coordinator PM Project Manager PMB Project Management Board PO Project Officer RAM Random-Access Memory REST Representational state transfer SDK Software Development Kit



SSN Social Security Number STM Scientific and Technical Manager SPR Semester Progress Report UC Use Case UI User Interface VDC Virtual Data Container VDM Virtual Data Manager VM Virtual Machine

Table 1 - Acronyms



2 DITAS components As stated on deliverable 4.2 [3], the DITAS framework can be divided into two logical groups, the one that encompasses the SDK Components (core compo-nents), and the one that manages the execution environment components. The Logical and Integration diagrams focus on the first logical group, core compo-nents, and the Deployment diagram focuses on the second logical group, the Execution Environment components.

2.1 Logical diagram (SDK components) The following image shows a Logical diagram of the DITAS core components and the relations among them:

Figure 1 - Logical diagram of core components

The core components of the DITAS SDK were explained in the deliverable 3.2 [4] and deliverable 4.2 [3]. Please refer to those documents for more details. As a reminder, the components of the Figure 1 can be shortly explained as follows:

● Data Utility Resolution Engine: Ranks the abstract VDCs in the BP repository. ● Data Utility Refinement: Evaluates the relevance of the data utility dimen-

sions. ● Data Utility Evaluator: Evaluates the data utility related to a data set. ● Privacy Security Evaluator: Filters and ranks blueprints according to non-

functional security requirements. ● VDC Repository Engine: Stores blueprints. ● VDC Resolution Engine: Searches and ranks blueprints. ● Deployment Engine: Deploys the Execution Environment components.

2.2 Integration Diagram Having an integration diagram is an undeniable necessity at the time of building the integration tests. The integration diagram of the next image shows the



interaction between the individual core components. A arrow means a call (re-sults are also considered) from one component to other. For example, if we have A → B, this means that component A calls component B. Each arrow defines 1...N calls to the destination component.

Figure 2 – Integration diagram of core components

It’s important to point out that several databases are used to manage data of different components. All the databases run as containers on the Staging and Production machines (see subsection 3.1).

● Deployment engine: Uses a MySQL database to store data for the de-ployment of a blueprint.

● VDC Repository Engine: Uses a MongoDB database to store the blue-prints and an Elasticsearch to index the blueprints so they can be searched afterwards.

● VDC Resolution Engine: Uses Elasticsearch to search indexed blueprints.

2.3 Deployment diagram (Execution Environment) The Deployment Engine is in charge of deploying the Execution Environment. The deployment order of the components is the following:

1. The Deployment Engine receives an Abstract Blueprint. 2. The Deployment Engine deploys a Kubernetes cluster. If a Kubernetes

cluster has been already deployed with for the same Abstract Blueprint, the Deployment Engine re-uses it.

3. The Deployment Engine deploys a Virtual Data Manager (VDM). As the Kubernetes cluster, if a VDM has been already deployed with the same Abstract Blueprint, the Deployment Engine re-uses it.



4. The Deployment Engine deploys the containers that conforms the Virtual Data Container (VDC)

5. The Deployment Engine updates the blueprint with the DNS names of all deployed services and creates the concrete blueprint.

Figure 3 – Deployment diagram

For more details about the Deployment Engine refer deliverable 4.2 [3].



3 Continuous integration system In this section, a deeper specification of the architecture and definition of the pipeline is explained. It is based on the Continuous Integration system introduced in the deliverable 5.1 [2].

3.1 Architecture The DITAS Continuous integration (CI) system architecture consists in five different machines deployed over the CloudSigma’s Cloud infrastructure.

The CI system itself consist of a master / slave architecture with two slaves. The master node serves the Jenkins web application and acts as the common data-base. The slaves are used for building the core components of the DITAS Frame-work. Having two slave nodes allows different building processes to take place at the same time.

The CI system deploys the components on the Staging machine, where the API validation and integration tests run. If both test success, the component is de-ployed on the Production machine.

Figure 4 – Continuous Integration system architecture

● Machine #1 – Jenkins Master ○ 3GHz, 4GB RAM, 50GB HDD ○ Where Jenkins application is installed and running. The application

is served on the 8080 port and is accessible via browser.



● Machine #2 – Jenkins Slave 1 ○ 3GHz, 8GB RAM, 100GB HDD ○ Where the code of the pipeline runs. As explained on the forth-

coming section, Docker will be installed on this machine as every developer needs to set up its own environment to build and run each component.

● Machine #3 – Jenkins Slave 2 ○ 3GHz, 8GB RAM, 100GB HDD ○ An exact copy of Jenkins Slave 1. This is a fallback for the Jenkins

Slave 1 machine. ● Machine #4 – Staging machine

○ 8GHz, 8GB RAM, 80GB HDD ○ Where the components run before going to production. This step

is necessary as we need to run the API validation and integration tests here before deploying to production.

● Machine #5 – Production machine ○ 8GHz, 8GB RAM, 80GB HDD ○ Where the component is finally deployed.

3.2 Jenkins pipeline The Jenkins pipeline of each component is divided in six stages. A stage is a pri-mary building block in the pipeline, dividing the steps of a Pipeline into explicit units and helping to visualize the progress using the Jenkins UI.

For the DITAS continuous integration system, and to ensure the quality of the product, along with a homogeneous development work, the Jenkins Pipeline of each component has the following form:

Figure 5 – Jenkins pipeline stages

● Build and test ○ Using a Dockerfile named Dockerfile.build, the developer creates

the necessary environment to make the unit tests and build its component using Docker containers (see subsection 0).

● Image creation ○ Using a Dockerfile named Dockerfile.artifact, the developer cre-

ates the necessary environment to run the component. Everything is packed on a Docker container (the component artifact) and pushed to the DITAS Docker Hub [5].



● Image deploy (Staging machine) ○ As the component is already on the DITAS Docker Hub, the

Docker image of the component is simply pulled from the hub and executed on the Staging Machine.

● API validation ○ Using Dredd [6], the API definitions of the component’s will be au-

tomatically validated. A further explanation can be found on the subsection 3.5, automated API testing.

● Integration tests ● Integration test between the components that are deployed on the

Staging machine. See subsection 3.6, Integration tests. ● Image deploy (Production machine)

○ After passing the API validation and integration tests, the compo-nent is deployed on the staging machine.

Simplifying the Jenkins Pipeline just to focus on the stages, the structure is as fol-lows: pipeline { agent none stages { stage('Build - test') { [...] } stage('Image creation') { [...] } stage('Image deploy - Staging') { [...] } stage('API validation') { [...] } stage('Integration tests') { [...] } stage('Image deploy - Production') { [...] } } }

In order to ease the developing experience email notifications are sent to devel-opers every time the pipeline fails, detailing the event and the error.

3.3 Building stage For a project to be built into a Jenkins node, the node must have every project dependency installed. For example, for a Java project, the JRE, Maven and / or other tools must be installed in the node. For a web application, PHP, NodeJS or other dependencies must be in place. On the other hand, not every project has to use the same versions of the dependencies. Moreover, the developer should indicate the project dependencies and Jenkins administrator must ensure that those are installed in every node in order to be able to build on it.

This is a pretty complex scenario. The nodes would end up with vast amount of dependencies and versions installed and this will lead to problems sooner or later.

In order to avoid this, to allow building in the same node very different projects with very different dependencies or different versions of them, the building



process must take place inside a Docker container. The container will have every dependency the project needs, and the building, test execution, and every other stage will take place inside it.

The following image illustrates this philosophy:

Figure 6 – Building using Docker

The Jenkins slaves are running Docker. When the pipeline is executed the indi-cated Docker container is deployed into the slave. The source code is the auto-matically copied into the container. As the container includes all the needed dependencies to run the project, the rest of the pipeline can be safely executed.

For this to happen, the Jenkinsfile [7] must specify that the project must be built inside a container, providing a Dockerfile in the root of the repository. We are using two different Dockerfiles for each project/component:

● Dockerfile.build: The Dockerfile used for the building process on the “Build and Test” stage, which must contain all the necessary dependen-cies to make the unit tests and build the component.

● Dockerfile.artifact: The Dockerfile used for the image creation process on the “Image deploy” stage, which must contain the necessary depend-encies to just run the application.

The image generated from the Dockerfile.artifact is lighter than the image gen-erated from the Dockerfile.build as it only contains the elements to run the appli-cation. For example, if we have a Java application, on the Dockerfile.build we will need maven to build and test the application, and on the Dockerfile.artifact we only need Java to run it.

3.4 Artifacts The artifact for the DITAS core components are the Docker containers that will be then deployed on the testbed. As stated on the previous section and to main-tain the open source philosophy of the project, these core components are pushed also to the DITAS Docker Hub. Docker Hub repositories let you share im-ages with co-workers, customers, or the Docker community at large.



Figure 7 – DITAS Docker HUB

3.5 Automated API testing For the automated API testing we are using Dredd, a language agnostic com-mand-line tool for validating API endpoints against its back-end implementation. Dredd reads the API description from the definition file and step by step validates whether the API implementation replies with responses as they are described in the documentation. If the Dredd command fails, that is, if for example the API returns something that doesn’t match with the API description, the pipeline will fail and stop.

For the development of the components APIs, we are using Swagger [8] for the definition file, which is the most widely used tooling ecosystem for developing APIs with the OpenAPI Specification (OAS) [9]. As Dredd only supports Swagger 2.0, we are currently using that version.

A Dredd call is as follows:

dredd swaggerComponentDefinition.yaml http://ip:component_port

3.6 Integration Tests Unit tests on the Build and Test stage focus on one particular unit of code. Often, this is a specific method or a function of a bigger component. These tests are done in isolation, where all external dependencies are typically stubbed or mocked, and using third-party tools like maven, for example.

Once our core components are deployed on the Staging machine and their API definition has been validated, we can focus on integration tests, where multiple and bigger units in interaction will be tested. The purpose of this level of testing is to expose faults in the interaction between integrated units.

The approach for the integration tests for each component depends on the code and the developer. As we don’t have a common programming language for the components, developers use any strategy that works for their code, such as maven, gradle o make directly curl calls.



If the integration tests fail, the pipeline will also fail, thus, the component will not be deployed on production.



4 Testbeds The following section describes the two testbeds used in the project to (i) host the core components of the framework and (ii) to deploy the use cases.

4.1 Core Components Testbed In this subsection we introduce the core components testbed, where the core components of the DITAS SDK reside.

4.1.1 Introduction In order to run and test the core components of the DITAS platform, an initial version of the testbed has been deployed over Cloudsigma’s Cloud infrastruc-ture. It consists in two servers:

● Staging ● Production

As detailed in previous sections, the machines count with 8GB RAM and 8GHz CPUs each and are running Ubuntu distributions. In order to run DITAS platform’s containerized components both of them have Docker Community Edition [10] installed.

It is important to mention that the main purpose of this testbed is to host and test the core components of the DITAS platform, as the rest of the components, the execution environment components are run over the Execution Environment testbed detailed later in this document.

As described in the deliverable 5.1 [2], every component developed in DITAS has its own repository on GitHub [11]. Along with the code, every repository has a Jenkinsfile defining the Pipeline for the Continuous Integration process (see sec-tion 3 for more details). The Jenkins-based Continuous Integration system is in charge of monitoring commits over each GitHub repository of the project and running the associated Jenkins Pipeline. Those jobs compile, build and perform unit tests over the component and, if every stage of the pipeline runs successfully, the component is deployed to the Staging server. The Staging server acts as a testing ground for each component. If the test that will be detailed on the next section success, the component is moved to the Production server.

End users from case studies can test and run production ready core components of the platform. The following image show the high-level process.

Figure 8 – From development to testbed process

4.1.2 Architecture and execution flow The containerized core component runs on the Staging machine, where the API validation and the integration tests run. These two processes will be explained later. If both processes success, the component will be deployed on the Produc-tion machine. If any of the process fail, the component will not be deployed.



Figure 9 – Architecture and execution flow

4.2 Execution Environment testbed For the first release of the project (at month 18) it was agreed to provide a testbed for each of the use case scenarios - IDEKO and OSR.

DITAS project shall provide possibilities for both users to be able to add their Edge devices in different Kubernetes [12] clusters as separate nodes and therefore to allow them to use the resources of the virtual machines on the cloud.

Each of the testbeds is described below in separate sections.

4.2.1 IDEKO use case Execution Environment testbed IDEKO use case provides as edge device 3 different machines. Each one of them is Ubuntu Server host with X86_64 architecture. More information about the hard-ware can be found on 5.2.3 Hardware Architecture section. They are added as nodes in Kubernetes cluster.

Beside the Edge nodes, 3 VM representing the Cloud part of the fog environment has been added to the cluster. They are located over 3 different locations in CloudSigma’s facilities. Each VM is running under Ubuntu 16.04 OS. For simplicity the name of the machines are node1, node2 and node3. Node1 is chosen as master node inside the Kubernetes cluster. Node1 is located in CloudSigma data center in MIA (MIami USA), Node2 is located in CloudSigma data center in ZRH (Zurich, Switzerland) and Node3 is located in SJC (San Jose, USA). As storage for the Kubernetes cluster, a Gluster cluster based on Gluster File System is used. Glus-terFS has many advantages as possibilities to scale data up to several petabytes, it provides replication, quotas, geo-replication, snapshots and bitrot detection, it allows optimization for different workloads, etc. On same locations there are a gluster cluster storage nodes named gluster1 (MIA), gluster2 (ZRH), gluster3 (SJC) and it has 300GB of storage and runs under Ubuntu 16.04 OS, which uses 15 GB of storage.

Due to the specific networking implementation in IDEKO facilities, it was required to configure port forwarding for the router for all Edge devices called. In the fol-lowing diagram IDEKO’s Edge devices are named the following way: edge1, edge2 and edge3.



Figure 10 – IDEKO Execution Environment testbed diagram

The testbed containing master node and other nodes as part of fog computing environment shall provide the possibilities to run as services all DITAS components to build necessary environment.

Deployment engine shall automatically deploy cluster and components so they will be working when deployment finished.

4.2.2 OSR use case Execution Environment testbed OSR use case provides an external access MySQL database for the first release with patients’ data. It can be accessed from outside of the Kubernetes cluster and can be synced using master-slave replication with MySQL database inside a container, running in Kubernetes cluster as a pod.

As selected environment within DITAS project is Kubernetes, a Kubernetes cluster with 3 VM is created, which represents the Cloud part of fog environment for this use case. In the same way as IDEKO use case, the VMs are located over 3 differ-ent locations in CloudSigma’s facilities. Also, each VM is running under Ubuntu 16.04 OS. For simplicity the name of the machines is node1, node2 and node3. Node1 is chosen as master node inside the Kubernetes cluster. Node1 is located in CloudSigma data center in MIA (Miami USA), Node2 is located in CloudSigma data center in ZRH (Zurich, Switzerland) and Node3 is located in SJC (San Jose, USA). GlusterFS is also used here for storage purposes. GlusterFS nodes are de-ployed on same locations than former nodes and named gluster1 (MIA), gluster2 (ZRH), gluster3 (SJC) and it has 300GB of storage and runs under Ubuntu 16.04 OS, which uses 15 GB of storage.

Database can be accessed with opened port and allowed connections only from Kubernetes network IPs.



Figure 11 – OSR Execution Environment testbed diagram

That testbed contains master node and other nodes as part of fog computing environment, which shall provide the possibilities to run as services all DITAS utilities to build necessary environment.

Deployment engine shall automatically deploy cluster and components so they will be working when deployment finished.



5 Case studies and DITAS framework This section covers the two case studies for the DITAS Project; the Industry 4.0 use case with IDEKO’s smart box with one unique scenario, and the e-Health use case with OSR with three different scenarios. For each use case, we define a contex-tualization (mainly for the ease of understanding), the data sources used for the use case, the VDC strategy and the application paired to the VDCs.

NOTE: the VDC blueprints for this use case can be found at https://github.com/DITAS-Project/blueprint/tree/master/ehealth (work in pro-gress).

5.1 OSR use case Based on what was presented in D5.1 (Specification of integration and validation testbed and cloud platform for DITAS) [2], we will specify three different scenar-ios, of growing complexity, and their related use cases.

5.1.1 Scenario 1 In the following, we present the contextualization, data sources and VDC strat-egy for the Scenario 1 of the eHealth use case, as presented in deliverable D5.1 [2].

5.1.1.1 Contextualization In this scenario, we revive the work done for the running example mentioned in D2.2 (DITAS data management) [14], which fits with the version of Scenario 1 re-ported in deliverable D5.1 [2].

Notice that while the infrastructure is the same (i.e., a Cloud environment and a hospital Edge environment is considered for both the original Scenario 1 in D5.1 [2] and the running example in D2.2 [14]), there is a difference in the actors con-sidered:

● while D5.1 [2] talks about hospital doctors, the running example in D2.2 [14] talks about family doctors;

● the running example in D2.2 [14] considers an additional figure, i.e., the researcher.

As the running example in D2.2 [14] works in the same context of D5.1 [2] and expands it in terms of actors, we will consider such an example in the following description.



Figure 12 – Scenario 1 actors

In this scenario, there is a single hospital involved, that has a trusted Cloud space, used to store clinical data and to provide access for researchers.

Patients data are distributed over:

● The Edge infrastructure of the hospital (both from the clinician side and from the researcher side);

● The Cloud environment of the hospital. As introduced in D2.2 [14], clinical data and biographical information of the pa-tients are stored in the cloud storage, and data can be moved between the cloud environment and the edge environment when needed by the actors (i.e., doctors and researchers).

In the following, we introduce the case study and related use case that could take advantage of the DITAS framework to help clinicians and researchers to do their work.

5.1.1.1.1 Case study

In this section, we report the case study associated with the Scenario 1. This is actually expanding the description of the very same scenario reported in Deliv-erables D2.2 [14], D3.2 [4] and D4.2 [3], adding more details.

5.1.1.1.1.1 Case study 1: ER doctor

Claudio Gialli, an ER doctor in Plainsboro Hospital, is treating an emergency situ-ation on Fiona Carrelli, a 50 years old patient that had a stroke. Claudio is inter-ested in knowing more about Fiona, and to check her blood tests. He thus ac-cesses the system so as to visualize her biographical data, the last values of her blood tests, and a time series of all the collected values for specific blood tests components (e.g., antithrombin) which are correlated with the stroke pathology.

5.1.1.1.1.2 Case study 2: Researcher

To improve their internal processes and to provide additional services to its pa-tients and employed doctors, the Plainsboro Hospital decides to digitalize the results of all the blood tests that had been taken in its laboratories from the 1960s till the 1990s.

In this way, as all the exams produced from the 1990s are already digitized, Plains-boro Hospital can enlarge its information heritage to make it available to



researchers that work in the hospital premises (thus trusted) that want to perform their research over blood test. As several blood tests per year are performed in the hospital, they constitute a good basis for research, since the high volume of produced data may allow researchers to have a proper estimate of blood test components distribution over the general population.

For instance, Luigi Rossi, a researcher in Plainsboro Hospital, performs studies on the cholesterol distribution over specific age ranges. To perform his research, Luigi accesses the system and requires to visualize the average distribution of choles-terol over a specific age range (e.g., 18-65). Luigi is a researcher, and thus he is not allowed to visualize the biographical data of a patient. Instead, he will re-ceive only the information he needs for his research, complying with the data minimization principle (Article 5 GDPR [14]).

5.1.1.1.2 Use case

As two actors are involved in this scenario, we are going to introduce two sepa-rate use cases.

5.1.1.1.2.1 ER doctor use case

ID and name UC-1.1 Collect patient’s clinical information, for treating an emer-gency

Description A patient has to be treated in the ER for an emergency. The doctor needs to retrieve the biographical data and blood tests of the pa-tient to understand how to treat him

Primary actor ER doctor

Secondary actors Patient

Trigger Emergency happens

Precondition PRE-1: Doctor is authenticated PRE-2: Doctor is authorized to request patient’s data

Postconditions -

Normal flow 1.1.0 Collect patient’s info of a patient with given SSN 1. Doctor enters the patient’s SSN 2. System retrieves all the information available for the speci-

fied patient (biographical data, blood tests) 3. System displays data 4. Doctor can browse through the retrieved material

Alternative flows -

Exceptions 1.1.E1 Patient is not registered to the hospital, since he never visited it

1. Doctor enters the patient’s SSN 2. System displays message: no data for the selected patient 3. System terminates the use case



Other information -

Assumptions -

Table 2 - Use case 1.1.- Collect patient’s clinical information, for treating an emergency

5.1.1.1.2.2 Researcher use case

ID and name UC-1.2 Collect average values of specific blood test components for a specific age range, for research purposes

Description A researcher wants to study the distribution of a specific blood test component over age ranges. He asks to retrieve the pseudony-mized patients’ data

Primary actor Researcher

Secondary actors -

Trigger Research request is issued

Precondition PRE-1: Researcher is authenticated PRE-2: Researcher is authorized to request (pseudonymized) pa-tients’ data

Postconditions -

Normal flow 1.2.0 Collect blood test components 1. Researcher enters the name of the blood test component

(e.g., cholesterol) and the age range (e.g., 35-60) 2. System retrieves the average value of the specified blood

test component 3. System displays data

Alternative flows -

Exceptions 1.2.E1 No data available for the selected age range or blood test component

1. Researcher enters the name of the blood test component (e.g., cholesterol) and the age range (e.g., 35-60)

2. System displays message: no data for the selected param-eters

3. System terminates the use case

Other information -

Assumptions -

Table 3 - Use case 1.2.- Collect average value for research purposes



5.1.1.2 Data sources In this scenario we consider the hospital data sources that are useful for:

● clinicians, to perform their diagnostic analyses; ● researchers affiliated with the hospital, to perform their research activi-

ties. The considered data sources are two, as follows:

● Patients’ biographical data. Biographical data are stored in a relational database (e.g., a MySQL instance) and contain the profile information of the patient (name, surname, address etc.);

● Patients’ blood tests. Blood tests are stored in an object storage (e.g., a Minio instance) and contain the measured values for the identified blood parameters. Not all the blood tests measure the same blood parameters, as doctors prescribe several forms of blood tests based on the type of pa-thology one wants to investigate on.

These data are currently generated synthetically, based on the distributions of the Italian population (geographic distribution, age distribution, gender distribu-tion etc.), so that in this very first experimentation phase real sensitive data remain untouched and realistic data can be still processed by the DITAS framework in a compliant manner.

5.1.1.3 VDC blueprint strategy A single VDC based on Spark is used in this scenario.

The developer of the medical application has to implement the client of the VDC CAF for the following methods:

1. Given the SSN, return all the biographical data of the patient 2. Given the SSN, return the last values for all the blood test components 3. Given the SSN and a specific blood test component (e.g., cholesterol),

return the time series of the values of this component for the given pa-tient

The developer of the researcher application has to implement the client of the VDC CAF for the following method:

1. Given a specific blood test component (e.g., cholesterol) and a range of age (e.g., 35-60), return the average value of this component.

5.1.2 Scenario 2 In the following, we present the contextualization, data sources and VDC strat-egy for the Scenario 2 of the eHealth use case, as presented in deliverable D5.1 [2].

5.1.2.1 Contextualization In this scenario, two hospitals belonging to the same hospital group are consid-ered. We assume that the hospital group owns a trusted cloud environment, where patients’ data can be stored safely whenever there is the need to save space in the hospital environment.



Figure 13 – Doctor may be in either of the hospitals


● The Edge infrastructure of hospital 1 and hospital 2, depending on whether the patient is hospitalized in either of them;

● The Cloud environment of the hospital group, whenever data are moved to the cloud to save space in the hospital environment.

In the following, we introduce the case study and related use case that could take advantage of the DITAS framework to help the medical doctor to treat an emergency situation.


In this section, we will discuss the case study in which a patient has an emergency situation and in order to treat it the doctor needs to collect data from both hos-pitals.

Paolo Galli has a stroke. He is going to come to the Plainsboro Hospital (which is in the group of hospitals and research centers Kelso Group in Italy, containing also Sacro Cuore Hospital) via the ambulance service.

Giacomo Galimberti, a doctor from the Plainsboro Hospital ER, is waiting for the ambulance to come, and decides to look in Paolo’s EHR (as collected from all the hospitals in Kelso Group and the remote storage in the hospital group cloud). Since the emergency team already visited the patient and already reported some data, Giacomo has an idea about Paolo’s situation (suspecting a stroke), and thus he would like to visualize the data in Paolo’s EHR that he thinks to be related to the “stroke” pathology. Specifically, he decides to visualize:

● biographical data; ● blood tests for the last 60 days; ● echocardiograms for the last 60 days; ● details on previous surgeries related to the “stroke” pathology.

Giacomo starts to browse through the retrieved information. By firstly consulting the description of previous surgeries, Giacomo discovers that there could be a reason for a stroke to happen, related with a health problem from the past: two years ago, a cardiac mechanical valve was implanted to Paolo and he had to undergo exams (e.g., blood tests, echocardiogram) and a surgery in Sacro Cuore Hospital. The stroke could be related to a blood clot dislodged from Paolo’s cardiac mechanical valve. Thus, Giacomo looks at additional infor-mation, composed of:

1. The last echocardiogram that Paolo took before the surgery;



2. The first echocardiogram that Paolo took after the surgery. Moreover, Giacomo looks thoroughly at Paolo’s blood tests history, being specif-ically interested in comparing several analyses that were performed while Paolo was hospitalized in Sacro Cuore Hospital:

1. The status of blood tests 5 days ago; 2. The status of blood tests 60 days ago; 3. The status of blood tests at the time of the surgery.

Paolo arrives at Plainsboro Hospital, and some routine checks are performed (e.g., generic blood tests [structured textual data], echocardiogram [image] and clinical assessment [unstructured textual report]). The analysis confirms what Giacomo suspected: Paolo has ischemic stroke.

Giacomo treats Paolo with appropriate drugs (i.e., anticoagulants). After that, Paolo is hospitalized.

5.1.2.1.2 Use case

In this section, we discuss the use case associated with the presented case study.

ID and name UC-2 Collect patient’s clinical information, for emergency treat-ment purposes

Description A patient has to be treated in the ER for an emergency. The doctor needs to retrieve all the available data for the selected patient

Primary actor ER doctor from hospital in a group of hospitals

Secondary actors Patient

Trigger Diagnostic request is issued

Precondition PRE-1: Doctor is authenticated PRE-2: Doctor is authorized to request patient’s data

Postconditions -

Normal flow 2.0 Collect patient’s info (based on SSN) 1. Doctor enters the patient’s SSN 2. System retrieves all the information available in the EHR for

the specified patient (biographical data, blood tests, echocardiograms, past surgeries for the specified pathol-ogy), collecting it from all the hospitals of the group and the hospital group cloud

3. System displays data 4. Doctor can either browse through the retrieved material, or

select two exams of the same type (e.g., blood tests or echocardiograms) and visually compare their outcome

Alternative flows 2.1 Collect patient’s info (based on SSN) and filter over exams col-lected during a specific time period

1. Doctor enters the patient’s SSN and the time period of in-terest



2. System retrieves all the information available in the EHR for the specified patient (biographical data, blood tests, echocardiograms, past surgeries for the specified pathol-ogy) produced during the specified time period, collecting it from all the hospitals of the group and the hospital group cloud


select two exams of the same type (e.g., blood tests or echocardiograms) and visually compare their outcome

2.2 Collect patient’s info (based on SSN) at a certain date 1. Doctor enters the patient’s SSN and the date of interest 2. System retrieves all the information available in the EHR,

collecting data from all the hospitals of the group and the hospital group cloud and reproducing the status of blood tests and echocardiograms at the specified date


select two exams of the same type (e.g., blood tests or medical images of the same type) and visually compare their outcome

2.3 Collect patient’s info (based on SSN) and visualize the ones col-lected around a clinical event (e.g., a surgery)

1. Doctor enters the patient’s SSN and the date of the clinical event

2. System retrieves the last exams (blood test and echocardi-ogram) before the clinical event and the first exams (blood test and echocardiogram) after the clinical event

3. System displays data

Exceptions 2.1.E1 Patient is not registered to any of the hospitals, since he never visited them and the emergency team did not register him

1. Doctor enters the patient’s SSN 2. System displays message: no data for the selected patient 3. System terminates the use case

Other information -

Assumptions Since hospitals are in the same group, a doctor employed by the hospital where the patient is recovered can access the patient’s data stored in the other hospitals of the group. This can be done because, when an emergency occur, the doctor can access the EHR [15] without problems. It is thus necessary, to handle data lawfully,

● to prepare a privacy policy for patients where it is written that the co-holders [“co-titolari”] of data and treatment are all hospitals and research centers in the group: in this way, data can be moved freely throughout the group and different data sources. If the patient is unconscious, the pri-vacy policy can still be applied



● to apply end-to-end encryption to data upon transferring (based on WP29, Data Portability), to guarantee data safety when data is transferred on an unsecure channel

Table 4 - Use case 2.- Collect patient’s clinical information, for emergency treatment purposes

5.1.2.2 Data sources In this scenario we consider the data sources in the hospital group that are useful for the medical doctor to treat an emergency situation, as discussed in the case study above.

The considered data sources are four, as follows:

● Patients’ biographical data. Biographical data are stored in a relational database (e.g., a MySQL instance) and contain the profile information of the patient (name, surname, address etc.);

● Patients’ blood tests. Blood tests are stored in an object storage (e.g., a Minio instance) and contain the measured values for the identified blood parameters. Not all the blood tests measure the same blood parameters, as doctors prescribe several forms of blood tests based on the type of pa-thology one wants to investigate on.

● Reports. Reports are unstructured documents reporting the outcome of a clinical event, e.g., a surgery.

● Clinical images. A clinical image is a “photograph” of some organs, taken for diagnostic purposes.

Moreover, in a separate location, the secrets linking patients’ data in the hospital group cloud and the hospitals are stored (encrypted) in a relational database. This is discussed more in details in the following sub-section (Secret sharing and linking data between hospitals and trusted cloud).

5.1.2.2.1 Secret sharing and linking data between hospitals and trusted cloud

European data protection law applies to “personal data” which is defined, in part, as “any information relating to an identified or identifiable natural person.” Data which has been anonymized is no longer “personal data” and is therefore not subject to the requirements of data protection law. Personal data that has been de-identified, encrypted or pseudonymized but can be used to re-identify a person remains personal data and falls within the scope of the law.

Personal data is subject to the protection requirements set out in the GDPR.

Article 4 of GDPR1 defines pseudonymization as:

“the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to en-sure that the personal data are not attributed to an identified or identifia-ble natural person”

1 http://data.consilium.europa.eu/doc/document/ST-5419-2016-INIT/en/pdf



To ensure compliance with such an article we adopt the following strategy. To simplify the presentation, we suppose the group of hospitals to be composed of two hospitals, that are, Hospital A and Hospital B.

Figure 14 – Mapping secret IDs and anonymized data in Cloud

The adopted data storage strategy ensures that patients’ identifiers are stored separately with respect to what is replicated in the cloud environment:

1. Patients’ biographical data (that comprise SSN, name, surname, address etc.) are stored in hospital environment only, and not replicated in the cloud environment

2. Patients’ clinical data (blood tests, clinical images and unstructured doc-uments) are moved to the cloud environment after a while, to save space in the hospital premises

To identify patients in the hospital environments and the cloud environments:

1. A patient is identified in a hospital A in the group with his patient ID (say, id_A) and his SSN (say, SSN_p)

2. A patient is identified in a hospital B in the group with his patient ID (say, id_B) and his SSN (say, SSN_p)

3. A patient is identified in the cloud environment with his identifier (say, fake_id)

The exchange of information about the IDs of entries in the edge and cloud en-vironments is shown in Figure 2.

In the group, we identify a specific hospital (say, hospital A) as Main Hospital.

The Main Hospital maintains on premises the secret binding the hospital entries and the cloud entries, that is:

1. Hospital A maintains the mapping between <SSN_p, fake_id> on premises, encrypted



2. Hospital B shares the key to that encryption, so that every time it needs to access patient’s data in the cloud environment, it:

a. Accesses with authentication to the dictionary containing map-pings between SSNs and cloud ids

b. Queries the dictionary with SSN_p and retrieves fake_id

c. Accesses the cloud retrieving all entries with identifier fake_id


The secret mapping between patients’ SSNs, IDs in the trusted cloud and IDs in the untrusted cloud (as mentioned in the Data Sources section) are handled by the Main Hospital in the group, as all hospitals are co-holders of data.

The VDC will handle data movement across Edge and Cloud environments, in the following points:

1. Retrieving data from other hospitals and trusted cloud whenever a pa-tient comes with a request to treat an emergency (ensuring re-identifica-tion and encryption upon transferring);

2. Outsourcing data to trusted cloud (ensuring pseudonymization and en-cryption upon transferring) when the hospital needs to free some space.

The developer of the medical application has to implement the client of the VDC CAF for the following exemplary methods:

1. Given the patient's SSN, return all the performed blood tests 2. Given the patient’s SSN and a date, return the status of blood tests at

that date 3. Given the patient's SSN and a period of time, return the blood tests exe-

cuted in this period 4. Given the patient's SSN and a date, return the last blood test performed

before that date 5. Given the patient's SSN and a date, return the first blood test performed

after that date 6. Given the patient's SSN, return all the performed echocardiograms 7. Given the patient’s SSN and a date, return the status of echocardio-

grams at that date 8. Given the patient's SSN and a period of time, return the echocardio-

grams executed in this period 9. Given the patient's SSN and a date, return the last echocardiogram per-

formed before that date 10. Given the patient's SSN and a date, return the first echocardiogram per-

formed after that date 11. Given the patient’s SSN, return the list of surgeries performed in the past

that are compliant with the tag “pathology” (e.g., “stroke”)



12. Given the patient’s SSN and a period of time, return the list of surgeries performed in the past that are compliant with the tag “pathology” (e.g., “stroke”) and are produced in the specified period

13. Given the patient’s SSN, return the patient’s biographical data

5.1.3 Scenario 3 In the following, we present the contextualization, data sources and VDC strat-egy for the Scenario 3 of the eHealth use case, as presented in deliverable D5.1.

5.1.3.1 Contextualization This scenario extends the Scenario 2: there are still two hospitals belonging to the same hospital group who can share data about their patients (as they are co-holders of data). Moreover, a third actor comes into place: a research center that reuses the data from the hospitals (previously anonymized so as to be com-pliant to data protection regulations) to perform some research.

Figure 15 – Scenario 2 actors


● The edge infrastructure of hospital 1 and hospital 2, depending on whether the patient is hospitalized in either of them;

● The edge infrastructure of the research center; ● The cloud environment of the hospital group, whenever the patient has

had a dismissal more than 30 days ago and data are moved to the cloud to save space in the hospital environment;

● The cloud environment of the research center; ● The untrusted cloud environment for researchers.

As this scenario shares part of the configuration of Scenario 2, the same case study (and related use case) presented for Scenario 2 can be used for Scenario 3 as well. Nevertheless, as a third actor is introduced in the picture, an additional case study (plus related use case) is introduced in the following.


Several research groups perform statistical researches on the available clinical data, either coming from public sources or distributed by hospitals.



Research data are shared to research groups using an untrusted cloud, named the ResearchInCloud environment. Data are stored here in an encrypted, pseu-donymized and minimized way.

Data in ResearchInCloud are produced and distributed as follows:

● Hospitals can decide to subscribe to the ResearchInCloud environment, and to share data for research purposes;

● Research centers can connect their private (trusted) cloud to Re-searchInCloud, subscribe to some set of data (based on their research topic), and mirror data from ResearchInCloud to their private cloud.

The hospitals of Kelso Group are actually affiliated with ResearchInCloud.

Stroke may have a correlation with nutritional events, e.g., high cholesterol levels or obesity2.

Thus, a research group called San Basilio (which is not affiliated with the group of hospitals Kelso Group) wants to perform a study on the impact that cholesterol has on stroke, to understand to which extent high cholesterol levels and obesity are correlated with stroke cases. The researcher assigned to this project is Vittoria Croci.

To perform the study, Vittoria would like to sample the distribution of Body Mass Index (BMI) values versus the cholesterol level from the following sub-populations:

● Overall population, people that did not have stroke ● Patients with stroke

Vittoria can access:

● The data from ResearchInCloud (with authentication). The blood test val-ues (cholesterol) and anthropometric indices (weight, height) of those hospitals that are affiliated with the ResearchInCloud provider, which are minimized for data protection purposes, and thus restricted only to: gen-der, age, height, weight, an information on stroke/no-stroke presence (automatically computed from the ICD codes found in the untrusted clouds for each entry), cholesterol values;

● The San Basilio cloud. The NutriChol dataset (stored there by San Basilio and not coming from ResearchInCloud), that describe the overall healthy population based on: gender, age, BMI, cholesterol values.

These data are useful for Vittoria to perform her study. Specifically, she would like to study separately:

● Stroke-related data. Vittoria wants to access the history of patients’ ex-ams (ResearchInCloud), to extrapolate the results of the most recently performed blood tests (specifically, cholesterol values) and

2 Correlation between high cholesterol, obesity and stroke: https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113



anthropometric indices (BMI computation, based on gender) for patients that had a stroke.

● Non-stroke related data. Vittoria wants to access the history of patients’ exams (ResearchInCloud), to extrapolate the results of the most recently performed blood tests (specifically, cholesterol values) and anthropo-metric indices (BMI computation, based on gender) for patients that did NOT have a stroke and still had a cholesterol analysis done. However, cholesterol analysis is not something that all patients do, and Vittoria would like to have a larger sample of data to support the study on the overall population. Thus, she decides to integrate the population of non-stroke patients with the data in the NutriChol dataset.

To perform her research, Vittoria:

1. Queries the system to get from the hospitals in the group BMIs and cho-lesterol values of patients with stroke, and computes

a. The distribution “BMI-vs-cholesterol” over the population with stroke

b. The count of people with “obese” BMI classification over the pop-ulation with stroke

2. Queries the system to get BMIs and the cholesterol values of healthy people (as a combination of hospitals healthy patients and NutriChol dataset), and computes

a. The distribution “BMI-vs-cholesterol” over the healthy population

b. The count of people with “obese” BMI classification over the healthy population

3. Performs the following analysis

a. Comparison of obese people count over the two population

b. Comparison of distributions “BMI-vs-cholesterol” of the two popu-lations, to see if there is a correlation (e.g., linear correlation3) be-tween the two

5.1.3.1.2 Use case

ID and name UC-3 Collect BMI-cholesterol distributions of individuals, for research purposes

3 It may be computed, e.g., via Pearson correlation coefficient: https://en.wikipedia.org/wiki/Pear-son_correlation_coefficient



Description A researcher wants to collect BMI-cholesterol values for a popula-tion of individuals, given some filter (e.g., only patients that have/do not have a specific pathology)

Primary actor Researcher

Secondary actors -

Trigger A new research study is requested

Precondition PRE-1: Researcher is authenticated

Postconditions -

Normal flow 3.0 Collect BMI-vs-cholesterol data of healthy sub-population 1. Researcher enters the filter “healthy” 2. System retrieves data (height, weight, cholesterol value,

gender, age) from the cloud of researchers (patients with-out stroke) and the cloud of the research center (nutritional dataset describing the overall healthy population)

3. System computes the BMI for each retrieved entry 4. System returns the data of the patients in the form [gender,

age, BMI, cholesterol]

Alternative flows 3.1 Collect BMI-vs-cholesterol data of “stroke” sub-population 1. Researcher enters the filter “stroke” 2. System retrieves data (height, weight, cholesterol value,

gender, age) from the cloud of researchers (patients with stroke)

3. System computes the BMI for each retrieved entry 4. System returns the data of the patients in the form [gender,

age, BMI, cholesterol]

Exceptions 3.0.E1 Data are not available for healthy patients 1. Researcher enters the filter “stroke” 2. System displays message: no data for healthy patients 3. System terminates the use case

3.1.E1 Data are not available for stroke patients 1. Researcher enters the filter “stroke” 2. System displays message: no data for the selected pathol-

ogy 3. System terminates the use case

Other information -

Assumptions We assume that every time a patient is subjected to a new blood test, data is pseudonymized and replicated on the private cloud of the hospital group, and then further replicated in the untrusted cloud dedicated to research. Generally speaking:



● Based on “art. 5, comma 4, lettera b, considerando 50 GDPR”, the usage of data for research purposes is always compatible

● GDPR says that for security reasons is enough to perform pseudonymization if the cloud is secured behind the hospi-tal group premises, while it is required to fully mask data to ensure safety when the storage premises are untrusted

● Based on Italian legislation (“art 110 bis codice privacy ital-iano”, as for 27th February 2018), data used for research re-quire to be anonymized. Probably this article will be de-leted when GDPR will be put in place in Italy too, since it does not confirm with GDPR

Table 5 - Use case 3 Collect BMI-cholesterol distributions of individuals, for research purposes

5.1.3.2 Data sources In this scenario we consider the same resources and data sources as in Scenario 2.

Plus, we consider:

● Public nutritional database. This dataset contains information (i.e., gen-der, age, cholesterol value, height, weight) of healthy patients, and can be used to perform nutritional researches over the cholesterol distribution in populations.

● Minimized dataset of patients. This dataset, in the cloud of researchers, is a view of patients’ data, encrypted and restricted to: height, weight, cholesterol value, gender, age.

5.1.3.2.1 Secret sharing and linking data between hospitals and untrusted cloud

Article 5 of GDPR4 requires that:

“Personal data shall be adequate, relevant and limited to what is neces-sary in relation to the purposes for which they are processed (‘data mini-misation’)”

Moreover, Recital 26 of GDPR requires that

“The principles of data protection should [...] not apply to anonymous in-formation, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”

Regulators and courts interpreting these terms have set a high bar for what qual-ifies as fully anonymized data under current data protection law based on the 1995 Data Protection Directive. In the following, we describe the strategy used to comply with these articles. However, the GDPR provides an opportunity to ex-amine this scenario anew. While the GDPR appears to retain a similarly high

4 http://data.consilium.europa.eu/doc/document/ST-5419-2016-INIT/en/pdf



standard for anonymity, it also suggests an openness to a more flexible approach that puts more focus on context and reasonableness. The GDPR provides some additional guidance in its recitals. For instance, Recital 26 in the GDPR is more expansive than the equivalent recital in the 1995 Data Protection Directive. It reads, in part: To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

The second sentence in this language is new and suggests that many factors must be considered in determining the likelihood that an anonymization method would be reversed. In particular, the reference to “all objective factors” must be read as including the context of the processing. And in real-world scenarios, that context necessarily includes factors such as the methodology employed, whether the data is closely held within a data controller or is publicly released, and the additional safeguards designed to prevent identification of individuals from the anonymized data set. Collectively, the consideration of “all” such fac-tors suggests a “reasonableness standard” rather than the “impossibility stand-ard” that seems to have taken hold under current law. Further, the GDPR con-tains new provisions that recognize differing intermediate levels of de-identifica-tion. Several provisions include an explicit recognition of pseudonymization as a method of reducing risk. Additionally, Articles 11 and 12 refer to a level of de-identification that falls short of full anonymization, but enables the data controller to “demonstrate that it is not in a position to identify the data subject.” Collec-tively, the provisions of the GDPR reflect a recognition that there is a spectrum of de-identification. These updates to the law provide an opportunity for a more flexible and nuanced approach across the full spectrum of de-identification, in-cluding where to draw the line between personal data and anonymous data, taking into account context and safeguards. Under such an approach, it should be possible to conclude that in at least some contexts, data anonymized and used for research purposes can still be considered anonymous even when the controller retains the original data set. The anonymized data set is not released publicly or widely shared, a robust anonymization method is used that has been vetted by experts in the field, and strong safeguards are in place to keep the data set separate and otherwise prevent the identification of data subjects from the anonymized data set. Such an interpretation and approach will encourage research that will inevitably result in enormous benefits to public health and wel-fare5.

To comply with data minimization, we remove every identifier from the public cloud where data for researchers are stored, since they are not needed by re-searchers to carry out their work.

5 European Legal Requirements for Use of Anonymized Health Data for Research Purposes by a Data Controller with Access to the Original (Identified) Data Sets [white paper]: https://iapp.org/media/pdf/resource_center/PA_WP1-Anonymized-and-original-data-sets.pdf



To guarantee that research data in the public (untrusted) cloud are maintained as fresh as possible, we ought to update them every time new data are available in the hospital group (trusted) cloud.

Thus, as in the case of the trusted cloud of the group, we appoint a hospital in the group to be the Main Hospital that maintains (encrypted) the mapping be-tween the patient’s ID in the hospital and the fake ID in the cloud environment. For details, see the previous sub-section (“Scenario 2”).

At the end, the information stored by the Main Hospital is as follows: <id_hospital, id_trusted_cloud, id_untrusted_cloud>


The hospital group participates in diagnostic activities and research activities. Thus, the VDC is supposed to control all data in the hospital group itself, and out-source some of the data in the untrusted cloud for researchers.

The secret mapping between patients’ SSNs, IDs in the trusted cloud and IDs in the untrusted cloud (as mentioned in the Data Sources section) are handled by the Main Hospital in the group, as all hospitals are co-holders of data.

The VDC will handle data movement across Edge and Cloud environments, in the following points:

1. Retrieving data from other hospitals and trusted cloud whenever a pa-tient comes with a request to treat an emergency (ensuring re-identifica-tion and encryption upon transferring);

2. Outsourcing data to trusted cloud (ensuring pseudonymization and en-cryption upon transferring) when the hospital needs to free some space;

3. Outsourcing data to untrusted cloud (ensuring minimization, anonymiza-tion, encryption upon transferring and storage of encrypted data) for re-search purposes.

The research center uses publicly available data to perform its research activities. Thus, the VDC is supposed to control all data useful for its research.

The VDC will handle data movement across cloud environments, in the following points:

1. Retrieving data from the untrusted cloud for researchers (ensuring en-cryption upon transferring and decryption in loco);

2. Merging the data retrieved from the untrusted cloud with some nutri-tional datasets in the trusted cloud of the center (ensuring aggregation to compute aggregate fields and integration of the datasets).

5.1.3.3.1 Methods

The developer of the medical application has to implement the client of the VDC CAF for the same methods presented for scenario 2.

The developer of the researcher application has to implement the client of the VDC CAF for the following method:



1. Return a set of entries in the form [gender, age, BMI, cholesterol], one for each patient tagged with a certain label (stroke/no-stroke)

5.1.4 Application The application(s) that could be paired to the VDCs designed for each of the three scenarios are the ones answering to the specified use cases (UC-1.1 and UC-1.2 for the Scenario 1, UC-2 for the Scenario 2, UC-3 for the Scenario 3).

5.2 IDEKO use case Based on what was presented in D5.1 (Specification of integration and validation testbed and cloud platform for DITAS) [2], we will give further details on what is going to be developed for demonstrating the framework.

5.2.1 Contextualization The use case will develop an advanced technical service application that im-proves the way the machine incidences are faced when they occur. The use case will put the focus at plant level. Basically the application will retrieve / pro-cess data from a customer plant and, in principle, the data will be isolated from other customer plants. The use case will define a scenario where the plant has 3 machines.

Currently, when an incident occurs, the customer gets in touch with the technical service of the machine manufacturer. The machine manufacturer tries to identify the failure based on the following aspects:

● The description the customer makes of the incidence ● The values of key indicators he gets from the machine:

○ Remotely connecting to the machine using Teamviewer like tools ○ Asking the customer to look at them

● Previous experience on similar failures

Once the origin is detected, if the fix cannot be applied remotely, an engineer must to travel to the customer’s shop floor, he then inspects the machine, con-firms the fault and tries to repair it. If the knowledge of the person is not suitable to make the repair, usually because the initial data for failure evaluation was not enough to select the proper person, another person must travel. This strategy de-rives in delays both in the incidence origin detection and the repairing of the same.

Customers are demanding a more real-time approach, where the manufacturer can detect critical events in real time.

On the other hand, at the moment, every machine sold by IDEKO’s industrial partner DANOBATGROUP, is sold with the Smart Box attached. The box automat-ically enables data cloud storage and edge computing. Being so, this is a proper scenario to develop high added-value data-based applications.

Moreover, some critical components like spindles and axes are present in every machine in the market, and there are just a bunch of different models of both. So, it seems to be a promising scenario to leverage the machines to learn from experience of others.

In summary, the present-future scenario has the following characteristics:

● Lots of machines with Smart Boxes attached



● Indicators (data) currently being compiled ● Local storage and computation enabled in every box ● Quite similar critical components in every machine

This is an appropriate scenario for developing high added-value applications. The proposed technical service application will help facing technical service common drawbacks. The app will make possible to the technical service, to have a detailed view of the state of the machine when it is needed. The main objectives to achieve are the following:

● Reduce diagnostics time ● Decrease response time ● Decrease travel costs ● Prevent failures of other machines

5.2.2 Machine simulator In order to control the data the machine sends and demonstrate the validation of the use case, we can’t deal with real machines working and waiting for an incidence to occur. For controlling the scenario, a Java-based Machine Data Simulator has been developed for this project and use case. This simulator be-haves like a real machine, but on a self-controlled scenario where data anoma-lies can be triggered on demand. The simulator uses previously recorded data from a real machine, so in summary, behaves like a real machine.

In manufacturing, each machine has different types of indicators as each nu-merical control manufacturer (CNC [17]) defines its own standard, so, to ease the understanding and the implementation of the use case, a simulator for a SORALUCE FS - 6801 machine will be used with a fixed amount of indicators.

Figure 16 – Soraluce FS milling boring machine

The following table contains the indicator list configured for this machine:



Indicators

System_Pilz_StopService System_IOLINK_CoolerTemp Spindle_speedCom-manded_rpm_d1000

System_Pilz_InternalStop System_IOLINK_UnlockingPressure Axis_FeedRate_actual

System_Pilz_SlidingDoor System_Pilz_XAxis System_DAS_EnableGeneral

System_Pilz_Bumper System_IOLINK_CoolerFlow System_DAS_EnableActuadorX

System_Pilz_EmergencyPushbutton System_IOLINK_HeadNeumat-icPressure

System_DAS_EnableActuadorY

System_Pilz_FrontDoor Machine_isON Cnc_IsInternalCoolantActive

System_Pilz_ZAxis Cnc_MachineRuningTime Cnc_IsExternalCoolantActive

System_Pilz_SPAxis Cnc_NcUpTime System_DAS_isInterpolacionGanan-cias

System_isConveyor3on System_Sensors_Temperature-Atmosphere_degreeCelsius_d10

System_DAS_PotenciaSpindle

System_Pilz_YAxis Spindle_IsUniversal System_DAS_FrecuenciaChatter

System_Pilz_BackDoor System_IOLINK_PneumaticPres-sure

System_DAS_Chatter

System_isConveyor2on Machine_Environment_temp System_DAS_ErrorDriverX

System_DAS_GananciaActualY System_IOLINK_Counterweight Cnc_IsAnyCoolantActive

System_DAS_SeveridadActualX Axis_Z_positionAc-tualMCS_mm_d10000

System_DAS_ErrorDriverY

System_DAS_ConsignaActualX Axis_Y_positionAc-tualMCS_mm_d10000

Cnc_Tool_Index

System_DAS_GananciaActualX Axis_Table1B_deg_d10000 Spindle_Caña_W_mm_d10000

System_DAS_TemperaturaY Axis_X_positionAc-tualMCS_mm_d10000

Alarms_IsAnyAlarmActivated_TR

System_DAS_ConsignaActualY Spindle_Temperature_degreeCel-sius_d10

Axis_X2_power_percent

System_isPneumaticsOn Axis_Table1W_posi-tion_mm_d10000

Spindle_IsTurning

System_isConveyor1on Axis_Z_power_percent Spindle_IsAutomatic

System_isChillerOn Spindle_C_deg Cnc_IsCyclePaused

System_DAS_SeveridadActualY Axis_Y_temperature_degreeCel-sius_d10

Spindle_A_Deg

System_IOLINK_HydraulicPressure Axis_Y_power_percent Cnc_IsAutomaticModeActive

System_isHydraulicsOn Axis_X1_power_percent Axis_Table1_U

Cnc_MachineUpTime Axis_Z_temperature_degreeCel-sius_d10

Cnc_Tool_Life

Machine_In_Head_Positioning_RT Cnc_Override_Spindle Spindle_IsDirect



Axis_X_FeedRate_actual Cnc_Override_Axis System_GavazziCabinet_VAR_In-stantaneous

Machine_In_Head_Change_RT Cnc_Tool_Number_RT System_GavazziCabinet_WT_Appar-ent_Power_total

Machine_In_Tool_Call_RT Cnc_Program_BlockNumber_RT System_GavazziCabinet_WT_Instan-taneous_active_power_total

Axis_Y_FeedRate_actual Cnc_Program_Name_RT System_Gavazzi-Cabinet_Hours_counter

System_IOLINK_HeadCoolerWater-Temperature

Spindle_Power_percent System_GavazziCabinet_To-tal_kWh_plus

Cnc_OperationMode Spindle_speedActual_rpm_d1 System_GavazziCabinet_PF_Instan-taneous

Cnc_SpindleRuningTime System_DAS_AmplitudChatter System_DAS_TemperaturaX

System_IOLINK_HeadCoolerWater-Flow

Axis_FeedRate_commanded Cnc_IsManualModeActive

Axis_Z_FeedRate_actual Cnc_IsCycleOn_RT

Table 6 - Indicators for a Soraluce FS machine.

As the use case will be demonstrated using 3 machines, three different periods of two months of data are going to be simulated by the Machine Simulator. This way the use case will simulate a single model of machine operating in different ways.

The Machine Simulator will be deployed inside the Smart Box. See the next sec-tions for more details.

5.2.3 Hardware architecture

The Smart Boxes are the edge nodes for IDEKO’s day to day application deploy-ments. But, as the smart boxes are production-ready devices and so they are fully managed by a global infrastructure, in order to have full control over the edge devices in this project and enable installing and moving components on de-mand without affecting the global infrastructure, 3 real Smart Boxes and 3 indus-trial PCs with a 1 to 1 relation will be used.

Each smart box will read data from a machine simulator and will have an indus-trial PC associated with it, this is, any DITAS component will be deployed in the associated PC instead of the machine itself. In summary, the edge nodes for the use case will be 3 industrial PCs. The following image describes the architecture in detail.



Figure 17 – IDEKO’s use case hardware architecture

The diagram represents the 1 to 1 relation between the Smart Boxes and the edge devices with some more details.

5.2.4 Data sources First of all, a high-level diagram with the data sources for the use case:

Figure 18 – IDEKO’s use case data sources6

The data sources are represented in green, blue and orange. Note that there is a common data source in the cloud and there are also per-box data sources. The next sections provide further details for each data source.

6 NOTE: M2C stands for “Machine 2 Cloud”, a Savvy protocol to send data from the box to the cloud database.



5.2.4.1 InfluxDB Every box will have an InfluxDB instance that collects data for a predefined time window. Due to the limited disk space available in the box, the InfluxDB instance will delete automatically old data to keep the amount of data stored within a controlled size.

For every machine we are saving the values for all the indicators so a time series database fits perfectly on this scenario. Time series are simply measurements or events that are tracked, monitored, downsampled or aggregated over time.

5.2.4.2 Cloud API JSON REST endpoints served from the cloud to get both historical and almost real-time data. Data produced in the machines is stored in a Savvy’s private cloud database. This API enables access to this data to the consumers. IDEKO has no direct access to the cloud database, being the API the only way to get data out from there.

Note that the database stores machines’ historical data since the box was con-figured. Data is not removed.

The following illustrates the data flow from the machine to the cloud database and the data gathering through the API.

Figure 19 – Data flow for the Cloud API

5.2.4.3 In-box API REST API served directly from the box that allows gathering real-time data and some other details right at the edge.

5.2.5 Virtual Data Containers for the use case The following sections define all the related stuff to the VDC and the use case, the strategy, the methods exposed and some other important information.

5.2.5.1 VDC blueprints strategy When defining the VDC blueprint strategy two suitable possibilities arise:

● One single VDC blueprint for the whole manufacturing plant where the BP manages all the data sources



● One VDC BP per machine where each BP has the three data sources defined in the Data sources section above.

The following is a diagram that represents the two approaches face to face:

Figure 20 – IDEKO’s use case VDC strategy approaches

For this use case, the proposed strategy is based on one VDC blueprint per machine where each VDC blueprint will have the 3 data sources attached.

- Why not the other approach, one VDC blueprint with all the data-sources on it?

Because we are working at plant level. The application will focus on the ma-chines that belong to a single plant. The number of machines of the plant may vary over time. If there is a single VDC, the new machine should be added to an already deployed VDC. Despite perhaps technically possible, several VDCs offer more flexibility while simplicity in the integration of new machines. With the strat-egy proposed in this document, If a new machine enters the shopfloor, a new VDC blueprint with the 3 data-sources is created and the corresponding VDC can be deployed in a standalone way.

NOTE: the VDC blueprints for this use case can be found at https://github.com/DITAS-Project/blueprint/tree/master/ideko (work in progress)

5.2.5.2 VDC Methods The following are the tentative methods to be exposed by the Data Administrator and defined in the blueprints:

# Data origin Method descrip-tion

Parameters

1 Cloud API

GetHistoricalData location - the location id machine - the machine id [group] - optional capture group id [indicator] - optional indicator id from - from timestamp in milliseconds to - to timestamp in milliseconds

Returns the corresponding data values for to the given parameters. The from and to param-eters are timestamps in milliseconds. Example call: /caf/GetHistoricalData?location=E3L3&machine=GXS_DSF-SEJ_2&group=C5FE6D&from=1531982561113&to=1531982621113

2 Cloud API GetStreamingData machines - comma separated list of machines ids



Returns the streaming data for the parameters given. MachinesIds can be a comma-sepa-rated list of machines. Example call: /caf/GetStreamingData?machines=GXS_DSFSEJ_2

3 InfluxDB / In-box API

GetIndicatorsData indicators - comma separated list of indicator ids from - from timestamp in milliseconds to - to timestamp in milliseconds

Note: As we are assuming a 1:1 relation between a box and a machine, we don’t need the machineId here to be passed as an argument. Returns the data and the human names for the given indicatorIdList of the given date range. The from and to values must be timestamps in milliseconds. Indicators values are gathered from the InfluxDB deployed in the smart-box. The human names are gathered from the in-box API. Example call: /caf/GetIndicatorsData?from=1531811400000&to=1531811700000&indica-tors=I_CNK_X1Z8SG_BT67AW,I_CNK_X1Z8SG_LPTD3K,I_CNK_X1Z8SG_VN7858

Table 7 - IDEKO’s use case blueprints methods

The above method list does not use the 3 data-sources available. As the devel-opment of the use case moves on, these methods may be modified to include other data-sources.

5.2.6 Application As stated before, the application will be focused on the Technical Service team aiming to give this team best insights on what is happening in the machine. The initial objective is to have the possibility to identify anomalous behaviors by ana-lyzing machine data. Once the anomaly is identified, the objective is to provide a clear, dashboard like view with the status of the machine on that very moment to debug the anomaly origin.

The above scenario depicts a decoupled-from-machine application, this is, something happens on the machine that is identified in near real-time and then some analysis is triggered. A technical service operator can debug the related event and machine status postmortem. However, detecting anomalies in real time is very powerful and, for example, the machine operator could be notified if an event like this happens. Notifying the operator in real time is something tricky, as there are very different CNC7 controls brands and application integration mechanisms for each differs a lot.

In summary, the application will be composed by three main pillars:

● Segmentation process ● Anomalies detection system ● Anomalies analysis system ● Dashboard for DANOBATGROUP’s Technical Service team

7 https://en.wikipedia.org/wiki/Numerical_control



5.2.6.1 Segmentation process

For analyzing the machine behavior when several operations are performed for the same manufacturing process, it is mandatory to divide the whole process into operations and compare them among different processes.

This task will segment the incoming machine data into machine operations. This segmentation is based on the following idea: Machine-tool production is deter-mined by CNC programs being executed by machine user. Data sensing infra-structure captures the program being executed at each instant. These programs are composed by a sequence of orders called blocks, that are identified by their position in the program source code. The number of the block being executed at each instant is also available. Unfortunately, this blocks represent usually very short processes, so to study the machine data for each of the blocks doesn’t usually provide a lot of information. This use case proposes an algorithm that seg-ments the machine data into operations, by grouping together execution blocks that belong to the same specific operation (drilling, milling-boring, roughing, ...).

The following flow diagram describes the process steps:

Figure 21 – Segmentation process flow diagram

The segmentation performs their operations for the machine configured. The segmentation process will run inside the Edge nodes (see subsection 5.2.6.6 - De-ployment diagram), so actually, it is a 1 to 1 relation between segmentation and machines.

5.2.6.2 Anomaly detection system

Along with this data segmentation pipeline, another process will be constantly executing, checking whether some specific conditions are satisfied. These con-ditions are defined based on machine variables, such as electric power, torque, temperature, load and vibration being above or below some certain thresholds. Depending on the severity of these conditions, these will be labeled as alarm (high priority) or warning (medium priority). Each time an alarm or warning is gen-erated, a notification will be sent to a pub/sub queue, allowing:

1. The helpdesk assistant: To receive an email notification that will include the time and type of incident, as well as the option to access the appli-cation to obtain information regarding the incident.

2. The anomaly notification system: To enable a notification service to al-low third party developers to get realtimes notifications.

3. Database storage for future access through the application.



The criteria used to determine whether an anomaly has occured are the follow-ing:

● Vibration too high: The vibration of the spindle head are captured by the DAS (Dynamics Active Stabiliser) system. The vibration severity is meas-ured in mm/s. The DAS system captures the vibration of the spindle in both directions X, and Y, and stores it in real time in several variables. It will be considered anomalous when at least one of these variables rises above the threshold 8 mm/s.

● Spindle temperature too high for too long: The temperature of the spin-dle will rise above 100ºC occasionally. It is not advisable that it rises above this threshold for too long. It will be considered anomalous when the variable


Figure 22 – Anomaly detection system process flow diagram

The process will check for anomalies as new data is generated, sending a notifi-cation message to a queue to enable actions over the detected anomalies.

5.2.6.3 Anomalies analysis

At runtime two type of computations will be performed:

1. Computations that need data from the Intervals table 2. Computations that need data from the Intervals table and raw data

from VDCs In the first case, the application implemented in this use case will make use of the pre-computed statistics as a base to run data analysis via statistical tests and machine learning algorithms to find patterns and trends in machine data, com-puted in application execution time. Some of these algorithms will focus in stud-ying the evolution of specific statistics along different executions of the same op-eration. Having these statistics pre-computed in the Intervals table allows for a higher efficiency.

In the second case, there are some other results that are not practical to be pre-computed and stored, so they will be computed each time they are demanded. These operations are excluded from this module. Some examples of these oper-ations are:



● Position of the operation being performed inside the outline of the whole workpiece.

● Evolution of the axis positions for the selected operations. ● Comparison of patterns between different executions of the operation.

The first two points of the above list will not be pre-computed because they make use of very detailed information and would not be practical to have the neces-sary information pre-stored. The last point will also make use of detailed, unag-gregated information, and will make use of more elaborated data analysis tech-niques, such as Dynamic Time Warps to evaluate dissimilarity between time series and unsupervised machine learning models to detect anomalous patterns. For these operations to take place it is required to have access to potentially big amounts (hundreds of thousands of records) of raw data based on the following criteria:

● Filtering by Program Name ● Filtering by a set of time intervals given by its begin and end time.


Figure 23 – Anomaly detection system process flow diagram

The process will be subscribed to a queue topic. As it gets a notification from the queue, it runs an analysis, whose results are stored and sent by email.

5.2.6.4 Application interface The application will be a dashboard-like web application fully focused on easing the live to the technical service team and will display the information sent by email by the previous process and, will provide some other interesting insights on the anomaly.

The information displayed by the application has two objectives:

● Provide the assistant information about the operation being performed at the time of incident.

● Compare that execution of the operation with previous times that oper-ation was performed, so anomalies and evolution in time can be ob-served.

The application will allow to navigate between tabs to obtain the relevant infor-mation. Some examples of what will be displayed on the app is shown below.

The next figure shows the reason why the anomaly was triggered.



Figure 24 – Anomaly trigger cause

The following figure shows the spindle trajectory during operation, where the technical service can see the position of the spindle at every moment during the process.

Figure 25 – Spindle trajectory during operation

The next figure shows the operation position in piece outline and highlights in red the position of the tool in the very precise moment where the anomaly was trig-gered.

Figure 26 – Operation position in piece outline

This application will be developed in Shiny, and R package that allows to build interactive web applications powered by R.



5.2.6.5 Components diagram

The following diagram represents the components architecture. The four main pillars describe above are clearly identifiable.

Figure 27 – IDEKO’s use case components architecture



An introduction to some non-mentioned components is given below:

The machine In the top of diagram there is the machine generating data. For the use case the machine will be simulated, see above sections for further details.

Computation database Persistent MySQL storage for storing computation data. Both segmentation data, anomaly definition data and anomalies analysis results are stored in this data-base.

Anomaly detection feeder An agent that reads real-time data from some VDC method and writes it to a queue to be available for decoupling the data generation system and the Anomaly detection system.

Queue system RabbitMQ queues to enable decoupling among components.

anom2db Stores the event of the anomaly into the database.

5.2.6.6 Deployment diagram

The following diagram represent the places where the components are ex-pected to be deployed. The VDCs, as they can be running from cloud or edge, are represented with no specific deployment place.



Figure 28 – IDEKO’s use case app components deployment diagram

In the upper part of the diagram it is represented the Smart Box, where the In-fluxDB database resides. There will also be deployed the machine simulator sys-tem that will enable custom data generation for demonstrating the use case.

In the below side, the three main components (segmentation, anomaly detec-tion and analysis) are represented inside every edge device. The data feeder along with the queue is also deployed there.



6 Conclusions Developing the DITAS project components with a cross-functional team and ge-ographically distributed can be a hard work. This document has made clear that having a continuous integration system helps DITAS partners to work more effi-ciently and more organized. The developers of project can benefit from the CI system, responding to rapid changes, while at the same time ensuring that the actual software under development are in constant sync. Every partner know that their contributions to the project are integrated and that the component parts work together, and if something doesn’t integrate, it’s quickly discovered.

Developing the case studies taught us that the DITAS approach can be really valuable for the development of complex scenarios: using virtual data contain-ers gives total transparency of the data sources for the application developer, and thus he only needs to worry to call the VDC methods that are serving the data, without knowing the real origin. Specifically, an in-depth analysis of the case studies brought us to these conclusions:

● eHealth scenario. The usage of a VDC masks different data sources (con-taining several types of data and with different privacy and security re-quirements) so that it becomes very easy for the end user to access data in a compliant way and with a single way of accessing data. Indeed, cur-rently hospital environments have several ways of accessing data (e.g., via several software solutions), and this could bring to the problem of hav-ing inconsistent accesses and data leaks. The VDC abstracts the data sources and enforces (privacy/performance/security) rules on them, so that we can ensure on one side the data protection for data subjects, and on the other side the application requirements compliance. The pre-sented three scenarios, that show how the situation abstracted by the VDC can be more and more complicated (adding external actors, add-ing different types of data sources, considering the GDPR), provide us a good playground for the application of such a methodology, as the two considered actors (medical doctor and medical researcher) want to ac-cess to the very same data sources (i.e., the ones containing patients’ data) but with very different requirements (both in privacy/security terms and in performance/data quality terms).

● Industry scenario. Using VCDs to give transparency of the data sources has been a great progress developing an industrial application. Configur-ing different data sources has been a continuous problem for application developers of the sector, where every application has to configure its own source, edge or cloud. Configuring each machine and remember every data end-point has been a well-known tough job on the Industrial sector. The abstraction that the VDCs gives and the developer only switching be-tween VDC endpoints to get data from different sources, has facilitated a lot the development of the application. Finally, having the VDC central-ized on the Node-RED based CAF, has eased the debugging if any prob-lems that data sources suffer.



References [1] Deliverable 2.1 DITAS Project. D2.1 DITAS Data Management – first release.

[2] Deliverable 5.1 - DITAS Project. D5.1 Specification of integration and validation testbed and Cloud platform for DITAS

[3] Deliverable 4.2 - DITAS Project. D4.2 Execution environment Prototype (First Release)

[4] Deliverable 3.2 - DITAS Project. D3.2 Data Virtualization SDK prototype (initial version)

[5] DITAS Docker Hub - https://hub.docker.com/u/ditas/

[6] Dredd - https://github.com/apiaryio/dredd

[7] Jenkinsfile - https://jenkins.io/doc/book/pipeline/jenkinsfile/

[8] Swagger - https://swagger.io/

[9] Open API specification https://github.com/OAI/OpenAPI-Specification

[10] Docker Community Edition - https://www.docker.com/community-edition

[11] DITAS project repositories at GitHub - https://github.com/DITAS-Project/

[12] Kubernetes - https://kubernetes.io/

[13] GlusterFS - https://docs.gluster.org/en/latest/

[14] Deliverable 2.2 - DITAS Project. D2.2 DITAS Data Management – Second Release

[15] GDPR - https://www.eugdpr.org

[16] Electronic Health Record (EHR) - https://en.wikipedia.org/wiki/Electronic_health_rec-ord

[17] Computerized Numeric COntrol (CNC) - https://en.wikipedia.org/wiki/Numerical_con-trol

ditas d5.2 - integration of ditas and case studies validation report · 2018. 12. 11. · d5.2...

Documents