d2.6 scenario development and kpi defination final rf€¦ · and benchmarking activities. based on...

PROTEUS Deliverable D<2.6>

687691 of 31

PROTEUS Scalable online machine learning for predictive analytics and real-time

interactive visualization 687691

D2.6 Scenario development and KPI definition for the PROTEUS solution

Lead Author: Tao Cao With contributions from: Rachel Finn

Reviewer: Daniel Toimil Martín

Deliverable nature: Report (R) Dissemination level: (Confidentiality)

Public [O]

Contractual delivery date:

30/11/2016

Actual delivery date: 25/11/2016 Version: 1.0 Total number of pages: 31 Keywords: Scenario; Benchmarking; Key Performance Indicators (KPIs);

Performance metrics

Deliverable D<2.6> PROTEUS

687691 of 31

Abstract

This deliverable sets the scene for a full-scale evaluation and impact assessment on the PROTEUS solution. The evaluation of the solution will consist of three scenarios aligned with the project milestones, KPIs and benchmarking activities. Based no the scenarios and KPIS, benchmarking of the PROTEUS solution will be undertaken using specific metrics for the ArcelorMittal usecase and generic indicators for scalable machine learning, hybrid computation and real-time interactive visual analytics. The report also outlines potential applications in other domains leveraging PROTEUS prototypes and incremental improvements gained through each iteration of the PROTEUS solution in the ArcelorMittal user case. This includes use of generic indicators which demonstrate advancements beyond the state-of-the-art in online distributed analytics and flexibility in the solution.


687691 of 31

Executive Summary

PROTEUS is focused on developing data analytics prototypes for leading steel company ArcelorMittal (industrial user) in solving challenging issues in its steelmaking process. ArcelorMittal defines requirements which are to be able to better control and track defects in the steel coils they produce using big data technologies and, thus, realize continuous improvements in the steelmaking process. Following an industrial user defined approach, PROTEUS aims to maximise the alignment between technological development and end-user industrial requirements. Furthermore, PROTEUS is intended to deliver effective solutions through innovative data analytics which can be applied in many other domains. Specific scenarios and concrete Key Performance Indicators (KPIs) will be used to demonstrate the outputs and ensure the exploitation of PROTEUS into the steelmaking industry and other domains with equivalent requirements. Following with an ongoing validation process in the actual ArcelorMittal facilities, to obtain valuable feedback and provide guidance for next development steps.

Based on the PROTEUS project mission, the objectives for this report are:

• Review the big data industrial application and identify existing big data benchmarks (Section 1: 1.1, 1.2,)

• Understand the context of the the ArcelorMittal case: underlying issues; data (availability, properties); requirements (PROTEUS prototypes) (Section 2)

• Define scenarios based on the requirements from ArcelorMittal (Section 3)

• Define KPIs of the solutions/prototypes for the scenarios (Sections 4)

• Plan for evaluation and demonstration of the solutions which represent an improvement beyond the state-of-the art in online distributed machine learning and the user case scenario (section 5)


687691 of 31

Document Information

IST Project Number

687691 Acronym PROTEUS

Full Title Scalable online machine learning for predictive analytics and real-time interactive visualization

Project URL http://www.proteus-bigdata.com/ EU Project Officer Martina EYDNER Deliverable Number D2.6 Title Scenario development and KPI definition

for the PROTEUS solution Work Package Number WP2 Title Industrial case: requirements, challenges,

validation and demonstration Date of Delivery Contractual M12 Actual M12 Status version 1.0 final □ Nature report x demonstrator □ other □ Dissemination level public x restricted □ Authors (Partner) TaoCao(TRI),RachelFinn(TRI)

Responsible Author Name TaoCao E-mail [email protected] Partner Trilateral Research Phone +44 (0)207 559 3550

Abstract (for dissemination)

This deliverable sets the scene for a full-scale evaluation and impact assessment on the PROTEUS solution. The evaluation of the solution will consist of three scenarios aligned with the project milestones, KPIs and benchmarking activities. Based on the scenarios and KPIS, benchmarking of the PROTEUS solution will be undertaken using specific metrics for the ArcelorMittal user case and generic indicators for scalable machine learning, hybrid computation and real-time interactive visual analytics. The report also outlines potential applications in other domains leveraging PROTEUS prototypes and incremental improvements gained through each iteration of the PROTEUS solution in the ArcelorMittal user case. This includes use of generic indicators which demonstrate advancements beyond the state-of-the-art in online distributed analytics and flexibility in the solution.

Keywords Scenario; Benchmarking; Key Performance Indicators (KPIs); Performance metrics

Version Log Issue Date Rev. No. Author Change 30/06/2016 0.1 Tao Cao (TRI) Initial ToC 13/10/2016 0.2 Tao Cao (TRI) Draft 09/11/2016 1.0 Tao Cao (TRI) TRI internal reviewed by Rachel

Finn 11/11/2016 1.0 Tao Cao (TRI) Circulated to project partner to

review


687691 of 31

24/11/2016 1.0 final Tao Cao (TRI) Changes from reviewers incorporated


687691 of 31

Table of Contents

Executive Summary ............................................................................................................................. 3Document Information ......................................................................................................................... 4Table of Contents ................................................................................................................................. 6List of Figures and List of Tables ........................................................................................................ 7Abbreviations ....................................................................................................................................... 81 Introduction ................................................................................................................................... 9

1.1 Big Data and Industrial Applications ..................................................................................... 91.2 Industrial Application Benchmarking .................................................................................. 11

2 PROTEUS Environment and Requirements ............................................................................... 162.1 Hot Strip Mill Problem Statement and Requirements .......................................................... 16

2.1.1 The Hot Strip Mill Problem Statement .......................................................................... 162.1.2 Identify the Requirements .............................................................................................. 17

2.2 Data property: Volume, Velocity, Variety ........................................................................... 182.2.1 Volume ........................................................................................................................... 192.2.2 Velocity .......................................................................................................................... 192.2.3 Variety ............................................................................................................................ 19

2.3 Datasets and Data Schema for PROTEUS ........................................................................... 203 Scenario Development ................................................................................................................ 21

3.1 Scenario Development .......................................................................................................... 213.2 Hybrid Computation Engine ................................................................................................. 223.3 Interactive Visualization ....................................................................................................... 223.4 Real-time Online Machine Learning .................................................................................... 23

4 KPIs, Evaluation and Benchmark Plan ....................................................................................... 254.1 KPIs and Benchmarks .......................................................................................................... 254.2 Planning for Running the Benchmark and Analyzing Results ............................................. 28

5 Conclusions ................................................................................................................................. 30References .......................................................................................................................................... 31


687691 of 31

List of Figures and List of Tables

Figure 1 The five-stage benchmarking methodology for big data systems[Han, R 2015] ................ 12Figure 2 The development of TPC benchmarks standards are in line with industry trends[Raghunath,

N., 2013] ...................................................................................................................................... 14Figure 3 Hot Strip Mill process ......................................................................................................... 16Figure 4 Dataset schema used in Proteus project ............................................................................... 20Figure 5 PROTEUS Machine learning model development and evaluation workflow ..................... 24Figure 6 Overall structure of evaluation and benchmarking work plan ............................................ 29

Table 1 User and system requirements .............................................................................................. 18Table 2 PROTEUS scenarios ............................................................................................................. 22Table 3 PROTEUS KPIs metrics ....................................................................................................... 26


687691 of 31

Abbreviations

HSM: Hot Strip Mill KPI: Key Performance Indicators

DW: Data Warehouse ETL: Extract, Transform, Load

RDBMS: Relational Database Management Systems HDFS: Hadoop Distributed File System

SQL: Structured Query Language TPC: Transaction Processing Performance Council

DS: Decision Support HS: Hadoop Systems

OLTP: On-line Transaction Processing OLAP: On-line Analytical Processing

RMSE: Root Mean Square Error FAR: False Alarm Rate

FDR: Fault Detection Rate


687691 of 31

1 Introduction

PROTEUS is focused on developing data analytics prototypes for the leading steel company ArcelorMittal (industrial user) in solving challenging issues in its steelmaking process. ArcelorMittal defines requirements which are to be able to better control and track defects in the steel coils they produce using big data technologies and, thus, realize continuous improvements in the steelmaking process.

Following an industrial user defined approach, PROTEUS aims to maximise the alignment between technological development and end-user industrial requirements. Furthermore, PROTEUS is intended to deliver solutions through innovative data analytics which can be applied in many other domains.

Specific scenarios and concrete Key Performance Indicators (KPIs) will be used to demonstrate the outputs and ensure the exploitation of PROTEUS into the steelmaking industry and other domains with equivalent requirements, followed by an ongoing validation process in the actual ArcelorMittal facilities, to obtain valuable feedback and provide guidance for next development steps.

Overall, PROTEUS’ mission is to evolve massive online machine learning strategies for predictive analytics and real-time interactive visualization methods into ready to use solutions, and to integrate them into enhanced version of Apache Flink, the EU Big Data platform. PROTEUS will contribute to the EU big data area by addressing fundamental challenges related to scalability and responsiveness of analytics capabilities, as well as difficulties in processing, analyzing, and visualizing big data.

This deliverable proceeds by outlining the ways in which big data and benchmarks are relevant for industrial applications. Second, it examines the requirements for the PROTEUS solution from the perspective of benchmarking processes and the industrial use case. Third, it defines scenarios and KPIs of the solutions/prototypes based on the requirements from ArcelorMittal. At the end of the deliverable is a plan for evaluation and demonstration of the solutions improvement.

1.1 Big Data and Industrial Applications

Industrial operations and physical systems generate increasing volumes of data in various forms via interconnected sensors, instrumentation, and smart machines. At the same time, advances in data storage, communication, and big data technologies are making it possible to collect, store, and process enormous volumes of data in scale and at speed. As a result, industrial systems produce very large volumes of continuous stream of sensor, event and contextual data. Such unprecedented amount of data needs to be stored, managed, analyzed and acted upon for sustainable operations of these systems.

Big data technologies, driven by innovative analytics, are critical to create novel solutions for these industrial operations systems to achieve better outcomes at lower costs including substantial savings


687691 of 31

in time and energy, and better performance which means higher quality products and longer last physical assets.

Efficiency and performance improvement requires leveraging analytics solutions to translate data-driven insights from a data sources into actionable insights delivered at the speed of a particular industrial business domain. Big data analytics innovations will be required: 1) to collect, access, process, retrieve, analyse and manage vast volumes, variety, and velocity of

data; 2) to create increasing value by moving from descriptive or historical analytics (what has happened

and why) to predictive analytics (what is likely to happen and when) and finally to prescriptive analytics (what are the best actions to take next).

Big data analytics drive transformative changes across industries and holds the promise of making industrial productions more efficient and cost effective, through innovative data analytics to gain insights derived from these rich types of data to better manage the industrial systems, operations, resources and products.

The characteristics of industrial big data systems can be described in terms of: 1) variety of data;

2) type of operations that occurs in these systems; and 3) latencies and kinds of decisions.

These characteristics will be illustrated using the PROTEUS industrial application in Section 2.

Big data applications can be represented through life cycle. Historical data stored in a Data Warehouse (DW) is processed, normalized and structured using extract, tranform and load (ETL) tools. The available operations in these systems are also limited to historical data analysis. Such traditional approach building applications based on DW has main drawbacks, because DWs only store part of the data, while optimal decision making at any time scale requires all the data available.

On the other hand, to meet increasing demand for high production and product quality as well as economical operation, today’s industrial processes have become more complex and automoated. Technologies of data generation, data capture and storage solutions have been developed from RDBMS (Relational Database Management Systems) to HDFS (Hadoop Distributed File System). High-volume datasets are ubiquitous in many domains and different forms. Some data is generated automatically by high-frequency sensors and needs to be processed in a timely fashion. There is also an increase in the variety of data (historical/streaming data, structured/semi-structured/unstructured data) and in the complexity of operations (ranging from queries to time series analysis to event correlation and prediction) in certain industrial systems, such as ArcelorMittal’s hot strip mill (HSM). These features require a novel real-time big data architectures to collect all available data, moving away from traditional business intelligence towards real-time operational intelligence, where a single system is able to decrease the latencies of decision making.

In the PROTEUS project, the ArcelorMittal steel production phase is a complex process divided into different component processes. A key process is the coil production because any defects


687691 of 31

introduced in that early stage have a significant economic impact. The detection of defects in the early stages of the process of steel production is a key point as early detection will have a great impact on reducing the costs of putting a defective product that will be eventually be rejected through the production process. Thus, the sooner the defects are detected, the sooner the process can be modified/stopped in order to avoid these expenses.

ArcelorMittal provides data which has been collected in one of its primary facilities, the Hot Strip Mill. Examples of available parameters include temperature, vibration intensity, tension in the rollers and the speed of the plate when entering the coiler. The variety of data being collected is changing from historical to streaming data (data from multiple sensors installed in a network of HSM), and from structured to unstructured data. Also, the complexity of the type data operations varies from queries on historical data to more data-driven coil flatness computation.

1.2 Industrial Application Benchmarking

Benchmarks are developed to evaluate and compare the performance of systems and architectures. Conceptually, a benchmark aims to generate application-specific workloads and tests capable of processing big data sets to produce meaningful evaluation results [Tay, Y.C., 2011]. Benchmarks have proven critical to both system users and vendors: users use benchmark results when evaluating new systems in terms of performance, price/performance and energy efficiency, while vendors use benchmarks to demonstrate the competitiveness of their products and to monitor release-to-release progress of their products under development. Listed below are five key aspects of a good benchmark articulated by Huppler’s (2009) [Huppler, K., 2009].

a) Relevant – A reader of the result believes the benchmark reflects something important b) Repeatable – There is confidence that the benchmark can be run a second time with the

same result c) Fair – All systems and/or software being compared can participate equally

d) Verifiable – There is confidence that the documented result is real e) Economical – The test sponsors can afford to run the benchmark

The complexity, diversity, and rapid evolution of big data systems give rise to new challenges in how to compare their performance, energy efficiency, and cost effectiveness. Figure 1 shows a typical benchmarking methodology for big data systems and it consists of five stages [Han, R 2014] [Han, R 2015].

1) stage 1, plans the application domain and the benchmarking object. 2) stage 2 surveys the representative applications in this domain, identifies data models from real

data, data operations and workload patterns from real workloads, and evaluation metrics from performance and cost indicators.

3) stage 3, based on the identification results, implements data generation tools to produce data sets and implements workloads to support application-specific benchmarking tests.


687691 of 31

4) Stage 4 determines the target system, and prepares the input data and the benchmarking prescription used to test this system. A prescription includes all the information needed to produce a benchmarking test, including input data, workloads, a method to generate test, and the evaluation metrics.

5) stage 5 is the benchmark testing, and results are analyzed and evaluated.

Figure 1 The five-stage benchmarking methodology for big data systems[Han, R 2015]

To date, most of the state-of-art big data benchmarks are designed for specific types of database and data analytics systems. Benchmarking is a workload designed to evaluate a system by providing realistic and accurate measures. The usual goal of benchmarking is to learn about the system’s behaviour and performance while coping with the workload. Benchmarking is important; it can effectively study what will happen when systems deal with workloads. It can observe the system’s behavior under workload, determine the system’s capacity, learn which changes are important, or see how the application performs with different data. In addition, benchmarking can test a variety of circumstances, based on the observed conditions.

One of the challenges with benchmarking is to identify and implement a realistic and meaningful workload that can reflect the real-life application domain. The workload used to stress the system is usually very simple in comparison with real-life workloads. The reason is that real-life workloads are non-deterministic, varying, and too complex to understand. Benchmarking the systems with real workloads will be harder to draw accurate conclusions from the benchmarks.

Hence, in industrial application benchmarking, identifying the typical workload for an application domain is the prerequisite of implementing workloads to evaluate the systems. Han identified the workload implementation requirements from the functional perspective and the system perspective. [Han, R. 2015].

Functional perspective: Abstract from the behaviors of different workload to a general approach, identify representative workload behaviors in the application domain.

• Operations: represent the abstracted processing actions (operators) on data sets. • Workload patterns: are designed to combine operations to form complex processing task.

One identified workload pattern can contain one or multiple abstract operations as well as


687691 of 31

their workflow. For example, a workload pattern representing a SQL query can contain select and put operations, in which the select operation executes first.

System perspective: The identified operations and patterns are designed to capture the independent behaviours of a workload system, i.e. the data processing operations and their sequences. Based on abstracted operations and patterns, an abstracted workload can be constructed and this workload is independent of underlying systems. From the system perspective, this abstract workload can be implemented on different software stacks (Flink, Hadoop and Spark) and thereby allows the comparison of systems of different type. For example, an abstract workload consisting of a sequence of read, write and update operations can be used to compare a DBMS and the Hadoop MapReduce system.

The trade-offs:

• It is possible to do more realistic load testing (as distinct from benchmarking), but it requires a lot of care in creating the dataset and workload, and in the end it is not really a benchmark.

• Benchmarks are simpler, more directly comparable to each other, and cheaper and easier to run. And despite their limitations, benchmarks are useful.

With the emergence of new systems driven by the exploration of big data value, covering diversity of workloads is the prerequisite to perform efficient benchmarking tests. Considering the diversity of big data systems, the BigDataBench (current version 3.2) models five typical and important mainstream big data application domains, namely, search engine, social networks, e-commerce, multimedia analytics, and bioinformatics.

There are many dimensions to a benchmark — the data size, the distribution of data and queries, but the most important is that a benchmark usually runs as fast as it possibly can, loading the system so heavily that it behaves badly. In many cases, benchmark tools would like to run as fast as possible within certain tolerances, to evaluate good performance. This would be especially helpful for determining the system’s maximum usable capacity. However, most benchmarking tools do not support such complexity, and the benchmarking tools can limit the meaningfulness and usefulness of the benchmarking results.

Over the past quarter-century, industry standard bodies like the Transaction Processing Performance Council (TPC) has developed several industry standards for performance benchmarking [Raghunath, N., 2013]. The TPC has a reputation of providing the industry with complete application level performance. The TPC benchmark standards were originally developed for transaction processing, later developed for benchmarking decision support systems, virtualization and data integration in line with industry demands. The TPC benchmarking model has been the most successful in modeling and benchmarking a complete end-to-end business computing environment rather than subsystem evaluation. TPC benchmarks have led the way in developing a benchmark model that most fully incorporates robust software testing [Nambiar, R.O, etc 2009]. Figure 2 shows the development of TPC benchmarks standards are in line with industry trends.


687691 of 31

Figure 2 The development of TPC benchmarks standards are in line with industry

trends[Raghunath, N., 2013]

1) The first TCP benchmark TPC-A was evolved into TPC- TPC-B, and was replaced by TPC-C, a

3-Teir Online Transaction Processing (OLTP) benchmark. TPC-C has been a standard since 1994.

2) TPC later added a new OLTP benchmark, TPC-E representing more complex transactions and system availability requirements which currently coexists with TPC-C.

3) TPC-D was the first Decision Support benchmark from the TPC. TPC-D evolved into TPC-H and TPC-R.

4) In 2010 the TPC added a new DSS benchmark, the TPC-DS representing modern decision support systems with multiple business channels and large number of complex queries [Poess, M, etc 2007]. TPC-DS has been a very popular workload in the academia and industry as base for several workloads.

5) The TPC has also developed benchmarks for database workloads in virtualized environment and data integration to address industry demands.

6) The next revolution in the data management platform space is Big Data, as the leading benchmarking council for transaction processing and database benchmarks, TPC is most well positioned to develop standards for it. The TPC Recently released the TPC Express Benchmark for Hadoop Systems (TPCx-HS) [TPCx-HS 2015].

TPC-H and TPC-DS, two decision support benchmarks, are employed to compare SQL-based query processing performance in SQL-on-Hadoop and relational systems. Although these benchmarks have been often used for comparing query performance of big data systems, they are basically SQL-based benchmarks and, thus, lack the new characteristics of industrial big data systems. BigDS benchmark [Zhao, J.M, etc 2013] extended TPC-DS for applications in the social marketing and advertisement domains. However, this proposal did not define a query set and data model for the benchmark.

Benchmarking is a complex and evolving task, especially benchmarking for industrial big data applications. Existing benchmark proposals either focus on OLTP/OLAP (On-line Analytical Processing) workloads for database systems or focus on enterprise big data systems. But benchmarks for industrial big data application are required to address a range of data and analytics processing needs of industrial big data applications. Common to all of the above benchmarks is that


687691 of 31

they are designed to measure limited features of big data systems of specific application domains (e.g., decision support, streaming data, event processing, distributed processing). However, none of existing benchmarks cover the range of data and analytics processing characteristic of industrial big data applications. These benchmarks have not addressed requirements of industrial big data applications, eg. PROTEUS, such as the typical data and processing requirements, representative queries and analytics operations over streaming and stored, structured and unstructured data.


687691 of 31

2 PROTEUS Environment and Requirements

The ArcelorMittal production line has Big Data sets containing both structured and unstructured data in historical and real-time domains. These data sets have become available from diverse sources at greatly diverse rates, which presents opportunities to engage in data procssing, interactive visualization, analytics and machine learning of big data.

In order ot clearly summarise the industrial application problem and requiments, some of the contents below are borrowed from PROTEUS proposal Deliverable 2.1.

2.1 Hot Strip Mill Problem Statement and Requirements

2.1.1 The Hot Strip Mill Problem Statement

The facility where the coils are produced is the HSM where the slabs coming from the continuous casting facility are transformed into coils with a defined width and thickness. This HSM process causes the original slab thickness to be reduced by more than 10 times.

The different variables of the process define the properties of the final product. For that reason, there exist physical factors that will determine the quality of the product. Figure 3 shows the HSM process divided into phases: preheating furnace, breaking-down zone, finishing zone, and coiling zone.

Figure 3 Hot Strip Mill process

Precision in the sensors measurement over time In ArcelorMittal facility, depending on their positions, some sensors will have a longer lifetime and data they are collecting is more accurate whereas others will become deteriorated easily and these


687691 of 31

data (variables) will be more volatile. Moreover, depending on the wear of the rolls, different defects will be more likely to arise. Thus, the continuous changes that occur in the facility make the available data dynamic and this will be a key point in the definition of the prediction models.

Continuous changes

Another important goal is that the model being formalized has to be dynamic. In fact, the model has to adapt and overcome the possible inclusion of incorrect data, the presence of partial data or the fact that some delay can be included before obtaining some data.

Additional consideration for online data is that there are certain stop/cuts in the stream. The reasons are diverse but consist mainly of the following ones: there is a planned stop in the process, there is an unplanned stop in the facility and/or there is a server crash. Whereas the first one is controlled and therefore, it could be treated in a different way (omitting this data or treating it in a different way due to its nature), in the remaining two cases there is no previous notification. Therefore, the model should be ready to face the erroneous data that will derive from the sensors.

On the other hand, the symmetry and asymmetry indices information could be modeled as target variables of an online machine learning problem: using the time series variables to compute symmetry and asymmetry indices containing flatness evaluation of the coil every metre. From a machine learning point of view, historical process data of past produced coils can be related with their corresponding symmetry and asymmetry indices, and used as predictor and target values respectively to train machine learning models, which are capable of predicting future “grades of flatness” of new coils that have not been inspected. It is also desired that these methods would outline the most relevant variables to the grade of flatness in the coils.

Finally, the data supplied by the inspection system are not immediate. By contrast, there is a short time delay until the system gives the indices. This fact needs to be taken into account, as the model has to be able to cope with time margin.

2.1.2 Identify the Requirements

To address the above issues, the users require new techniques that are capable of adapting to changing-real-time-online-massive data and to extract relevant information, such as early detection of flatness (even when the coils are still being manufactured in the HSM); the identification of relevant variables that might have an impact over the appearance of defect; or the discovery of additional hidden information through visualization tools. An additional requirement, apart of predicting the flatness of coils, is to find anomalies in the time series representing the symmetry and asymmetry flatness. A plausible option is to develop a model that checks in advance if new incoming data makes sense. For all these reasons, the model being formalized cannot be fixed, but it has to be dynamic in order to address incoming changes.

User requirements:


687691 of 31

• The main goal for ArcelorMittal is to detect defects in coils in early stages of the production process in order to cut economical loss derived from the fact that a faulty coil continues through the steel manufacturing process, eventually being rejected. The current ArcelorMittal inspection system can provide quantities measurement of coil flatness in the form of symmetry and asymmetry indices of the strips. However, the inspection system cannot make an assessment of the coil as defective or non-defective (refer D2.1 for more details). Therefore, a deeper analysis is required to try to overcome the problem.

• The main requirement is based on the detection of anomalies in the time series representing the symmetry and asymmetry indices, identifying the variables that enhance the appearance of flat coils and the range of variation where these defects arise, displaying a warning when those variables are exceeding the limit of the non-defective band. This can be an approach of main interest for ArcelorMittal.

System requirement: To archive the above user requirements,

• An integrated processing engine for managing batch data (data-at-rest) and data streams (data-in-motion) in a hybrid-merge mode to serve as basis for the real-time machine learning and visual analytics modules.

Table 1 shows the the identified requirements along with corresponding PROTEUS technological prototypes.

Table 1 User and system requirements

Number Type Requirement PROTEUS Technological prototype

R1 User Coil flatness prediction Online Machine Learning

R2 User Relevant variable visualization

Interactive visualization

R3 System Processing historical data

Integrated processing engine

R4 System Processing real-time data

Both the system and user requirements will be addressed in Scenarios shown in Section 3.

2.2 Data property: Volume, Velocity, Variety

In order to get these data, ArcelorMittal has a sensor system capable of measuring certain parameters about the coils. Furthermore, the type of data developed by the system is real-time continuous massive data, which cannot be analysed with classical analytical methods.


687691 of 31

The data generated by ArcelorMittal satisfy the representative characteristics according to the 3V properties of big data. In order to clearly summarise the data properties, some of the contents below are borrowed from Deliverable 2.1:

2.2.1 Volume

The data collected and intended to be analysed is presented in two different datasets:

Process dataset: constructed from the HSM process database. The process data is composed of the measures of sensors, which are distributed along the facility measuring several process variables. In this dataset, each coil has associated a single value for each variable, which, in many cases, represents a summary (an average) of the measurements collected along the time. The HSM process database includes both quantitative and qualitative data, stored in 42 different tables. All the tables share the same key variable, which allows us to join and relate all this information with certain specific coils. The tables in the HSM database store a total of 7475 variables related to the coil production process since 2010 with ~840000 records for each variable. It contains mostly numerical and categorical values and its size increase as new coils are produced. The size of each of the 42 tables is around 300-700 MB.

In addition, in order to measure flatness, different sensors are installed in the HSM. Coil flatness maps are generated once the entire coil is processed and measured by the flatness measurement system. These maps are unstructured data (images) that contains useful information about the flatness of the coil. One flatness map is available for each coil, the flatness maps historical data is around 300G.

2.2.2 Velocity

The tables of the HSM database are updated continuously, as new coils are processed. As a general guideline, the generation rate is usually between 32 and 500 milliseconds. This variation makes the system have misunderstandings with data caught at different time instants, since there is not a general generation rate. The system has mixed all this data together in order to have them in a manageable way to analyse it. Data containing information about detected defects are not updated at the same rate as the HSM database, since the processed coils are not evaluated for defects until later when the coil is processed in the flatness measurement system.

2.2.3 Variety

The process data generated by the sensors are structured data, both quantitative and qualitative, and stored into 34 different schemas with a total of 7870 variables. Data generated by an internal model that evaluates the flatness of the coil are unstructured data, stored in SIG image format, contains data in two formats: time series and maps (similar to a heat map) that represent the flatness of the coil. Each SIG file stores 42-time series/maps associated with the coils, including their flatness as a target variable among others One flatness map is available for each coil. The interval in which each variable is measured varies depending of the system capabilities to acquire and store the data and the speed of the coil being processed.


687691 of 31

2.3 Datasets and Data Schema for PROTEUS

This section describes the main characteristics of the available process data, which contains historical information about processed coils in the HSM, and flatness measurements from an inspection system.

The dataset schema used in Proteus project is included Figure 4., which explains two different data types: historical data (HSM) and "real-time" data (COILTIMESERIES).

The HSM table, containing historical process data from HSM, has 7475 columns representing sensor data variables belonging to a particular coil. A row in HSM represents the historical process data belonging to one specific coil, and the variable V_0001 is the coil identifier.

COILTIMESERIES has the data generated in real time for coil processing. Each COILTIMESERIES row has 5 columns:

• ID: coil identifier

• Variable: Variable name

• PositionX: sensor position on axis x.

• PositionY: sensor position on axis y. (it could be with NULL value, there are some variables with none Y axis points).

• Value: value given by sensors in points specified by positionX and positionY.

Figure 4 Dataset schema used in Proteus project


687691 of 31

3 Scenario Development

This section will analyse the functional scenarios from the point of view of big data analytics, based on the objectives of the PROTEUS project. The objectives are to identify the functional and technical gaps that will be addressed in the scope of the project. This section will also set out the scenarios based on the requirements and solutions of the PROTEUS project

3.1 Scenario Development

The alignment with the end-user industrial scenario needs during the development of the project, and the definition of clear functional, scientific, technical and verifiable requirements are key issues. This will demonstrate the outputs and ensure the exploitation of PROTEUS into the steelmaking industry and other domains with equivalent requirements.

The overall goal of PROTEUS is to provide a scalable online machine learning solution not limited to a specific scenario. However, certain demonstration scenarios are chosen for implementation during the project runtime. The scenarios aim to use the ArcelorMittal case to evaluate the PROTEUS innovations of scalable big data online machine learning and analytics.

Scenario details and objectives description defined in D2.1 have been expanded to cover the technology development and prototypes evaluation during the project. Three scenarios (aligned with the Milestones of PROTEUS project) will be evaluated during the project. Specific scenarios will be used to obtain feedback about the project advances to provide guidance for the next steps. • First scenario (due by M18): will evaluate the integrated processing engine for analysing both

data-at-rest and data in-motion in a hybrid-merged way within Apache Flink platform. It will also include a first version of stream library on top of Apache Flink capable of querying the data streams of the industrial scenario anytime.

• Second scenario (due by M24): will include a first version of the incremental visual analytics infrastructure (data collector, incremental analytics engine, visualization layer), enabling novel ways of visualizing, analysing and interacting in real-time with data streams and historical datasets.

• Third scenario (due by M30), will incorporate: i) the final version of the real-time interactive visual analytics infrastructure; ii) a first completed version of the online learning algorithms for Big Data streams; and iii) the declarative language on top of Apache Flink for easily analysing data streams and batch datasets using the integrated processing engine.

The methodology of scenario development has been as follows:

a) Partners involved in all scenarios have elaborated drafts of possible scenarios according to data collected in user requirement studies D2.1.

b) These drafts were presented in the PROTEUS Consortium Meeting held in Bournemouth 20th and 21th October 2016.


687691 of 31

c) The whole consortium discussed the proposal of scenarios according mainly to user requirements and importance on PROTEUS prototypes development, but taking also in consideration if actions included add innovative challenges. Discussion was moderated by TRI. Preferred actions and situations were selected by the consortium at the Consortium Meeting.

d) Partners and users elaborated some technical requirements and procedures according to Scenario specifications, in order to elaborate use-cases.

e) Finally, the scenario descriptions were elaborated by TRI as participants directly involved in Task 2.4. These descriptions are included in Table 2.

According to the Consortium Meeting, the subsequent scenario descriptions represent goals for PROTEUS development.

Scenarios defined in D2.1 will be revised according to the following criteria:

Table 2 PROTEUS scenarios

Scenario Userrequirementsexplicitlyaccomplished

Firstscenario:integratedprocessingenginetoanalyzebothbatchdata(data-at-rest)anddatastreams(data-in-motion).

Processing historical data/ real-time data

Secondscenario:Real-timeInteractiveVisualization. Relevant variable visualization

Thirdscenario:ScalableOnlineMachineLearning. Coil flatness prediction

3.2 Hybrid Computation Engine

First scenario: integrated processing engine to analyse both batch data (data-at-rest) and data streams (data-in-motion). In this scenario, based in an integrated processing engine for managing batch data (data-at-rest) and data streams (data-in-motion) in a hybrid-merge mode to serve as basis for the real-time machine learning and visual analytics modules, the end-users such as plant managers or the plant operators can:

• actually query the extremely large and diverse (structured and unstructured) historical manufacture data

• and real-time data streams from the HSM, while, in a continuous process (steel is transformed from slabs to coils after heating and rolling the material through rolls at high pressure and temperature, also keeping a controlled tension over the material, and finally cooling by using water showers.)

• The historical and streams data can be manipulated, used to compute statistics and to find correlations between different types of steel.

3.3 Interactive Visualization


687691 of 31

Second scenario: Real-time Interactive Visualization. In this scenario, there are large sets of historical data, and the data streams are dynamic in real time. Benefitting from the integrated hybrid processing engine the end-users can:

• query these data,

• interactively visualize, analyze and interact, in real-time, with data streams and historical datasets

• analyze and visualize the properties of the products

• compute statistics and to find correlations between different types of steel. This correlation analysis provides an offline/online analytics task that can be handled by online learning system.

3.4 Real-time Online Machine Learning

Third scenario: Scalable Online Machine Learning. In this scenario, the end users can use PROTEUS novel scalable online machine learning library and data mining algorithms:

• to cope with the high speed online data streams. • to process data covering all steps of real-time stream processing from preprocessing, basic

sketches, to more elaborate predictive and tracking analytics. • In addition to the large historical data, provide a very fast and asynchronous model

representation for the real-time analytical/prediction tasks. • In contrast to the stale historical data, the model for online machine learning has to react to

immediate changes. • This imposes some requirements for the representation of the prediction model. For instance,

the model should be able to asynchronously update as data arrives from sensors over time without having to lock - an operation that is quite expensive when it comes to model training using tens or hundreds of machines in a cluster.

There are multiple stages in developing such a machine learning model for use in PROTEUS application:

• The first phase involves prototyping, where different models will be developed to find the best one (model selection).

• Once satisfying with a prototype model, deploying it into production, where it will go through further testing on live data.

Figure 5 illustrates this machine learning model development and evaluation workflow.


687691 of 31

Figure 5 PROTEUS Machine learning model development and evaluation workflow

Such machine learning models will be developed to predict the degree of flatness of the coils, to enhance ArcelorMittal operators’ expertise knowledge, and help them to take action accordingly, in order to allow them to identify the defective coils.


687691 of 31

4 KPIs, Evaluation and Benchmark Plan

It is best to benchmark what is important to users. In D2.1, PROTEUS project initially gathered some requirements about what acceptable response times are (sub-seconds) and what kind of prediction performance is expected. The KPI and benchmarks will be designed to satisfy all of the requirements.

4.1 KPIs and Benchmarks

In general, benchmark is to evaluate a system in terms of performance on certain workload. Workload are defined by certain operations, which can be represented with a few numbers called load parameters. Performance is the term used to describe a system’s ability to cope with workload. Before defining the performance, we need to explicitly describe the workload on the system, only then can we discuss the performance growth questions, such as what happens if workload doubles(scalibility); how easily the system handles the workload, which is generally divided into response time (the time between a user request and a response to the request from the system) and throughout (how much work is accomplished over a period of time). For a specific application, such as PROTEUS, the selected workloads, operations and parameters depends on the application requiments, tasks and architecture of the system. Thus, the workload and parameters are connected to the PROTEUS scenarios, requirements and system components. Based on the requiments shown in D2.1, partners selected a set of operations, performance metrics and KPIs that will adequately represent the requirements of the ArcelorMittal use case. Partners also specified the typical operations and work load patterns in order to understand the different stages in the manufacturing process and the analytic processes leading to the desired evaluation and benchmarking requirements.


687691 of 31

Table 3 PROTEUS KPIs metrics

In Table 3, column 1 displays three manufacturing/analysis processes that represents the three PROTEUS Scenarios defined in Section 3; column 2 represents the workloads, operations depended on the PROTEUS scenarios requiments, tasks and architecture of the system; column 3 shows the workload parameters in column 2; the last column are the KPIs to be used to measure the performance of PROTEUS solution. Whereas the Scenarios and workloads defined the operations and tasks for the system developed by PROTEUS project, the KPIs are the variables to be used to measure the performance of PROTEUS solution. Benmarks based on these KPIs will evaluate and demonstrate the advancements in large-scale and predictive analytics from PROTEUS solution. Based on PROTEUS Scenarios, we select the following KPIs: the requests per second to a data processing engine, ratio of requests /response from a interactive visulization tool, the percentage of active data for visualization, the response legacy time from the interactive visualization requests, the accuracy and recall of the online prediction model. At this stage, the PROTEUS prototypes are

Manufacturing/analytic processes

Workload /Typical

operations

Workload parameters

KPIs

Data process Query/access historical data/ online data

Requests per second to a data processing engine / ratio of reads/writes in a database

Identify the amount of data being processed; the number of data queries happen at any given time; the speed at which the data is normally processed (depends on the workload)

Visualization Visualization query/ interactive query

Real-time/low latency response time (based on statistics, eg. 90 percentiles ); The response legacy time from the interactive visualization requests;

Identification of defects/relevant variables; response time from interactive query like zoom; amount/percenages data is critical to detect the defects.

Online leaning Drift handling, novelty detection, active learning

Root-mean-square error (RMSE); the prediction accuracy; classification recall of the online prediction model

Improvement of defects detection; reduce defects; For real-time: Two most common evaluation indices are false alarm rate (FAR) and false detection rate (FDR).


687691 of 31

still under development. In the next step, the benchmarks will performed beased on the selected KPIs. In the following development stage, the technical team will continually work with ArcelorMittal to specify the requirements related to their particular systems. This will include:

• issues related to the type and variety of sensor data selected, • the volume of data necessary to conduct a useful analysis (i.e., how many of the 7870

variables need to be queried) • the reliability of the data • the speed at which it should be analysed (in milliseconds) and • the effectiveness of the visualisation techniques

At the end, the project will undertake an in-depth examination of the different big data benchmarks currently in use to assess the extent to which these are relevant given that PROTEUS focus on the ArcelorMittal use case and the specific associated requirements identified.

The first scenario developing integrated processing engine to analyse both batch data (data-at-rest) and data streams (data-in-motion) will focus on the low latency aspects related to scalability, velocity and variety of data streams. PROTEUS will meet the low latency streaming requirements, and Flink is the selected platform to support true streaming with fault tolerance, to support for iterative algorithms, a feature that is very vital for implementing machine learning algorithms.

The visualization requirements from PROTEUS are interacting with large amounts of historical data in real-time through visualization tool interfaces. The requirements of low latency response will address:

• data volume: query complex and query rate

• data velocity: data generating rate,

• processing speed: data processing (historical/real time),

• visualization: low latency response The online machine learning will focus on the algorithmic aspects related to scalability, velocity and variety of data streams. These features of Apache Flink makes it the most suitable core technology for implementing the PROTEUS advances in online scalable machine learning algorithms and the incremental approach proposed to achieve real-time interactive visual analytics. Both machine learning and visulization will address:

• real-time data

• large amounts data

Online evaluation measures live metrics of the deployed model on online data; offline evaluation measures offline metrics of the prototyped model on historical data (and sometimes on live data as well to detect the concept drift from new types of steel). An online learning model continuously adapts to incoming data, and it has a different training and evaluation workflow. Online and offline evaluations may measure different metrics. Offline evaluation might use one of the metrics like accuracy or precision recall. Online evaluation, on the other hand, might measure metrics such as the real-time flatness measurements of streaming data. Such real-time flatness measurements are


687691 of 31

not available on current historical data but are closer to what the steel manufacturer business really cares about.

Note that there are two sources of data: historical and stream. Many statistical models assume that the distribution of data stays the same over time, but in practice, the distribution of data changes over time. This is called distribution drift. In the PROTEUS scenario, theoretically, the types of steel can change every month, sometimes every day; what was produced yesterday may no longer be relevant today. Hence it is important to be able to detect distribution drift and adapt the model accordingly. One way to detect distribution drift is to continue to evaluate the model’s performance on the validation metric on live data. If the performance is comparable to the validation results when the model was built, then the model still fits the data. When performance starts to degrade, then it is probable that the distribution of live data has drifted sufficiently from historical data, and the model needs to be retrained. Monitoring for distribution drift is often done “offline” from the production environment. Hence it is part of the offline evaluation. Online evaluation measures metrics that are really matter, such as the flatness measurements of steel coils, which can be predicted based on the relevant streaming data, but are not diretly available on historical and live data.

Considering the aforementioned aspects, the proposed solution should be able to provide near-real time analysis of the HSM process, discover unexpected situations, anomalies in the process, and existing patterns. The proposed machine learning pipeline should be parallelizable, scalable, and provide incremental processing of the data.

The identification of PROTEUS improvements will occur through an evaluation process. It will be based on data supplied by ArcelorMittal with known parameters, error rates and timeline for identification of defective products that can be used to measure improvements achieved by each PROTEUS software prototype. The current datasets are divided into train/test sets to measure the performance.

4.2 Planning for Running the Benchmark and Analyzing Results

Based quantitative and qualitative evaluation the methods and tools resulting from the PROTEUS investigations, the aim of the benchmarking is to measure improvements gained for each PROTEUS prototype and using the PROTEUS solution as a whole. The process to run to benchmarking:

• The first step in planning a benchmark is to identify the problem and the goal.

• Second step, decide which standard benchmark to use.

• Benchmark is a complicated and iterative process. The start point is to manage the ArcelorMittal production dataset. Make sure this dataset can be used for subsequent benchmarking.

• Next, run queries against the data.

• Finally, design some method of documenting parameters and results, and document each run.

• The documentation method will be custom-designed database. There are scripts to help run the benchmarking and analyze the results the easily.


687691 of 31

Figure 6 shows the overall structure of evaluation and benchmarking work plan.

Figure 6 Overall structure of evaluation and benchmarking work plan

It is not yet clear whether these benchmarks adequately reflect on and meet the specificities of the data processing needs associated with the ArcelorMittal context. As such, the purposes of the benchmarking and validation tasks within PROTEUS are three-fold:

• To evaluate the extent to which existing big data benchmarks are relevant to the PROTEUS project given its specific context within the ArcellorMittal case (evaluate and identify relevant benchmarks)

• To assist the consortium in identifying and measuring the improvement brought by PROTEUS to the ArcelorMittal manufacturing system(identify/establish the base line)

• To demonstrate that the PROTEUS solution represents an advancement beyond state-of-the art in online distributed machine learning in general, beyond the specific use case scenarios.

PROTEUS partners will use this information to develop the requirements for the software aspect of the solution to provide improvements in the amount of data analysed and the timescales associated with this analysis. This will be linked with improved visualisation techniques to enable ArcelorMittal employees to identify potential problems.

The benchmarking and evaluation exercise is also important to enable the consortium to demonstrate the achievements of the PROTEUS solution beyond the ArcelorMittal use case. As introduced, the evaluation of each of the PROTEUS prototypes will be evaluated twofold: i) under WP2 for the experimental use case, and ii) extended in Task 6.2 through more generic indicators to demonstrate flexibility and impact beyond the specific scenario; both will feed back the system design and development process to identify areas for improvement.

WP2 − Evaluation and benchmarking

WP5 − Real time interactive visualization

WP4 − Scalable online Machine Learning

WP3 − Integrated hybrid data processing engine

S

cena

rios &

KPI

s

Ben

chm

arki

ng

ç ç ç

Continuous evaluation benchmarking


687691 of 31

5 Conclusions

This deliverable represents scenario development and KPI definition for the PROTEUS solution. It introduces the common characteristics of big data applications and industrial application benchmarking; describes the ArcelorMittal problems in context, covering data properties,availability and required solutions.

Following project objectives, this report analyses the functional scenarios using big data analytics to identify the functional and technical gaps that will be addressed in scope. The scenarios are developed based on the requirements and solutions of the PROTEUS project.

This deliverable selects a set of performance metrics and KPIs that will adequately represent the requirements of the ArcelorMittal user case, also specifies the typical operations and workload patterns in order to understand different stages in the steelmaking process and the analytic process leading to the desired evaluation and benchmarking requirements.

The report also outlines plans for evaluation and benchmarking, in order to analyse results and demonstrate the benefits to ArcelorMittal system and incremental improvements gained through each iteration of the PROTEUS solution.


687691 of 31

References

Tay, Y.C., 2011. Data generation for application-specific benchmarking. VLDB, Challenges and Visions.

Raghunath, N., 2013, July. Towards an Industry Standard for Benchmarking Big Data Systems. In Workshop on Big Data Benchmarks (pp. 193-201). Springer International Publishing.

Huppler, K., 2009, August. The art of building a good benchmark. In Technology Conference on Performance Evaluation and Benchmarking (pp. 18-30). Springer Berlin Heidelberg.

Han, R., Lu, X. and Xu, J., 2014, March. On big data benchmarking. In Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware (pp. 3-18). Springer International Publishing.

Han, R., Jia, Z., Gao, W., Tian, X. and Wang, L., 2015. Benchmarking Big Data Systems: State-of-the-Art and Future Directions. arXiv preprint arXiv:1506.01494.

Nambiar, R.O., Lanken, M., Wakou, N., Carman, F. and Majdalany, M., 2009, August. Transaction processing performance council (TPC): twenty years later–a look back, a look ahead. In Technology Conference on Performance Evaluation and Benchmarking (pp. 1-10). Springer Berlin Heidelberg.

Poess, M., Nambiar, R.O. and Walrath, D., 2007, September. Why you should run TPC-DS: a workload analysis. In Proceedings of the 33rd international conference on Very large data bases (pp. 1138-1149). VLDB Endowment.

Zhao, J.M., Wang, W.S., Liu, X. and Chen, Y.F., 2013, July. Big data benchmark-big DS. In Workshop on Big Data Benchmarks (pp. 49-57). Springer International Publishing.

Transaction Processing Performance Council, TPCx-HS, February 2015. www.tpc.org/tpcx-hs/

d2.6 scenario development and kpi defination final rf€¦ · and benchmarking activities. based on...

Documents