intelligent data analysis // · pdf filetime series databases (opentsdb, influxdb) otime...

52
1 Budapest University of Technology and Economics Department of Measurement and Information Systems Budapest University of Technology and Economics Fault Tolerant Systems Research Group Data processing Intelligent Data Analysis http://www.mit.bme.hu/node/8036

Upload: truongdung

Post on 07-Mar-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

1Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems

Budapest University of Technology and EconomicsFault Tolerant Systems Research Group

Data processing

Intelligent Data Analysis

http://www.mit.bme.hu/node/8036

Page 2: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

2

Outline

Data format/representation

Data processing

ETL, workflow support

Outlook: OLAP

Case studies

Page 3: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

3

Data science „process”

https://en.wikipedia.org/wiki/Data_science

Page 4: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

4

DATA FORMAT

Page 5: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

5

Tidy data

3 Simple rules to facilitate statistics and visualization

One variable – one column

One observation – one row

Each type of observational unit – one table

… seems to be trivial

… not true in most practical cases

… and even for staitstical tools (e.g. output of R packages)

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.https://github.com/hadley/tidy-data

Page 6: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

6

Data originally: long/wide

https://en.wikipedia.org/wiki/Wide_and_narrow_data

Page 7: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

7

How to use these formats?

Sparse Screening for Exact Data Reduction. Jieping Ye, Arizona State University

Page 8: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

8

Examples for tidy data

http://garrettgman.github.io/tidying/

R dataframe representation:

Page 9: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

9

„tidying”

R: spread(data,key,value)

http://garrettgman.github.io/tidying/

Page 10: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

10

„tidying”

R: spread(data,key,value)

Generalization?

http://garrettgman.github.io/tidying/

Page 11: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

11

Data restructuring examples ( in R)

https://www.r-statistics.com/2012/01/aggregation-and-restructuring-data-from-r-in-action/

Page 12: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

12

DATA STORAGE

Page 13: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

13

Common data storage techniques

.CSV

o Majority of inputs

o Length? Header? Encoding?

DB with a schema (in memory?)

Graph databases, ontologies, RDF…

Key-value stores (redis)

Time series databases (openTSDB, influxDB)

o Time series + metadata

„Data in motion”

o Streams as input for processing/analysis

Page 14: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

14

Time series example: influxDB

Data: measurement

o Fields, tags, timestamp

Page 15: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

15

Dashboards… (e..g Grafana)

https://grafana.com/dashboards/1443

Page 16: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

16

DATA PROCESING WORKFLOW& TOOLS

Page 17: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

17

ETL

„Extract-Transform-Load”

Originally: to fill a snowflake/star schema

In data science: create dataframes

Cleaning tasks

o Standardization

o Normalization

o Deduplication

o Enrichment

o Clear/fill NAs

Page 18: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

18

Example data processing workflow (KNIME)

Steps: reading, filtering/aggregation,

transformation, plotting, …

Status of the concrete execution

KNIME

Page 19: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

19

Measurement processing: RapidMiner

Deleteunnecessary

attribute

Calculatingaverages(interval)

Read CSVFormat

conversionIdentifying source

node

Filter tocpu.usage.average

Add machineinformation

Page 20: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

20

CASE STUDY

Processing of telco data

Page 21: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

21

SOME BACKGROUND… OLAP

Page 22: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

22

On-Line Analytical Processing (img: snowplowanalytics.com)

Business intelligence approach

Extensively used since early 2000so Still! (although not that popular as it was

– at least in academic research)

FeaturesoMulti-dimensional analysis

o Fast query execution

o Exploratory analysis of data• Support ad-hoc queries

o Report generation

o (Visualization)

Page 23: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

23

On-Line Analytical Processing (img: snowplowanalytics.com)

Central concept: OLAP cube

o Multi-dimensional array:

• set of separate data– Dimensionality >3

– technically a hypercube

– ~ a multi-dimensional spreadsheet

o Slicer: dimension held constant

• For a given query (e.g. sales in a particular year)

Page 24: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

24

OLAP process (img: Pranav Joshi)

Page 25: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

25

OLAP operations

Operations

o Slicing & dicing

o Drill up & down

o Pivoting

Easy to visualize by the cube itself

Page 26: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

26

Slicing (img: Wikipedia)

Page 27: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

27

Dicing (img: Wikipedia)

Page 28: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

28

Drill up & down (img: Wikipedia)

Page 29: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

29

Pivoting (img: Wikipedia)

Page 30: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

30

OLAP vs. “regular/modern” data analysis

OLAP cube: like a set of spreadsheets

o multi-dimensional

o interlinked

Modern data analysis: “flat” data frames

oModern machine learning algorithms:

• require (?) single dataframes

Operations: basically the same (slicing, dicing, drill up & down, pivoting)

Page 31: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

31

CASE STUDY3„Deep insights from observations with the help of modern data analysis tools” – CECRIS IAPP project

Railway accidents: casualties by type of accident, Department for Transport Statistics, Rail Statistics, Table TSGB0805 (RAI0501)

(https://www.gov.uk/government/organisations/department-for-transport/series/rail-statistics)

Analysis: next class, now let us process the data

Page 32: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

32

PowerBI data import

Page 33: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

33

Load data to PowerQuery

Page 34: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

34

Remove unnecessary top rows

Page 35: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

35

Remove unnecessary bottom rows

Page 36: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

36

Remove blank rows

Page 37: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

37

Remove columns

Page 38: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

38

Promote first row to header

Page 39: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

39

Filter “total” and “all” rows

Page 40: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

40

Split first column

Page 41: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

41

Replace empty values to null in first column

Page 42: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

42

Replace empty values to null in second column

Page 43: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

43

Remove colon character from first column

Page 44: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

44

Automation: RapidMiner process

https://my.rapidminer.com/nexus/account/index.html#downloads

Page 45: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

45

Read Excel

Read measurements

Page 46: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

46

Filter rows

Page 47: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

47

Split

Page 48: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

48

Rename attributes

Page 49: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

49

Loop attributes

Page 50: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

50

Replace spaces

Page 51: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

51

Replace colon character

Page 52: Intelligent Data Analysis // · PDF fileTime series databases (openTSDB, influxDB) oTime series + metadata „Data in motion _ oStreams as input for processing/analysis. 14 Time series

52

Header row problem

To be kept

To be removed

(derived)