designing experiments - vargas-solarvargas-solar.com/data-centric-smart-everything/wp... ·...

Genoveva Vargas-SolarFrench Council of Scientific Research, LIG

[email protected]

Designing ExperimentsOverview of data management & exploitation solutions

http://vargas-solar.com/data-centric-smart-everything/

2

Volume

VelocityVariety

Veracity

Value

1000 Yottabytes 1 Brontobyte

1000Brontobytes

1 Geopbyte

Computational Science

Digital humanitiesSocial Data Science

Network Science

Computation(Algorithm: mathematical model)

Experiment setting(Architecture: computing environment)

3

Big data variety: the right model according to data

4

DATABASE ENVIRONMENTS LANDSCAPE

+

22/09/2019

CuratedRaw

Increased versatility& complexity

Increased scalability& speed

Data collections rawness degree

Key-Valuestores

Documentstores

Extensiblerecord stores

NewSQLRelational databases

Graph Databases

QueryingLook up (R/W) Analytics

AggregationProcessing Navigation

DATA CURATION AT SCALE

6

Tuple Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar

Document Allows values to be nested documents or lists, as well as scalar values. Attributes are not defined in a global schema

Extensible record Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be

added on a per-record basis

DATA MODELS

7

Key-value Systems that store values and an index to find them, based on a key

Document Systems that store documents, providing index and simple query mechanisms

Extensible record Systems that store extensible records that can be partitioned vertically and horizontally across nodes

Graph Systems that store model data as graphs where nodes can represent content modelled as document or key-value

structures and arcs represent a relation between the data modelled by the node

Relational Systems that store, index and query tuples

DATA STORES

8

“Simplest data stores” use a data model similar to the memcached distributed in-memory cache

Single key-value index for all data

Provide a persistence mechanism

Replication, versioning, locking, transactions, sorting

API: inserts, deletes, index lookups

No secondary indices or keys

SYSTEM ADDRESS

Redis code.google.com/p/redis

Scalaris code.google.com/p/scalaris

Tokyo tokyocabinet.sourceforge.net

Voldemort project-voldemort.com

Riak riak.basho.com

Membrain schoonerinfotech.com/products

Membase membase.com

KEY STORE VALUES

SELECT nameFROM groupWHERE gid IN ( SELECT gid

FROM group_memberWHERE uid = me() )

9

SELECT name, pic, profile_urlFROM userWHERE uid = me()

SELECT name, picFROM userWHERE online_presence = "active"

ANDuid IN ( SELECT uid2

FROM friendWHERE uid1 = me() )

SELECT nameFROM friendlistWHERE owner = me()

SELECT message, attachmentFROM streamWHERE source_id = me() AND type = 80

https://developers.facebook.com/docs/reference/fql/

10

<805114856,

>

11

Support more complex data: pointerlessobjects, i.e., documents

Secondary indexes, multiple types of documents (objects) per database, nested documents and lists, e.g. B-trees

Automatic sharding (scale writes), no explicit locks, weaker concurrency (eventual for scaling reads) and atomicity properties

API: select, delete, getAttributes, putAttributes on documents

Queries can be distributed in parallel over multiple nodes using a map-reduce mechanism

SYSTEM ADDRESS

SimpleDB amazon.com/simpledb

Couch DB couchdb.apache.org

Mongo DB mongodb.org

Terrastore code.google.com/terrastore

DOCUMENT STORES

12

DOCUMENT STORES

13

Basic data model is rows and columns

Basic scalability model is splitting rows and columns over multiple nodes Rows split across nodes through sharding on the primary key

Split by range rather than hash function

Rows analogous to documents: variable number of attributes, attribute names must be unique

Grouped into collections (tables)

Queries on ranges of values do not go to every node

Columns are distributed over multiple nodes using “column groups” Which columns are best stored together

Column groups must be pre-defined with the extensible record stores

SYSTEM ADDRESS

HBase hbase.apache.com

HyperTable hypertable.org

Cassandra incubator.apache.org/cassandra

EXTENSIBLE RECORD STORES

14

SQL: rich declarative query language

Databases reinforce referential integrity

ACID semantics

Well understood operations: Configuration, Care and feeding, Backups, Tuning, Failure and recovery, Performance characteristics

Use small-scope operationsChallenge: joins that do not scale with sharding

Use small-scope transactionsACID transactions inefficient with communication and 2PC overhead

Shared nothing architecture for scalability

Avoid cross-node operations

SYSTEM ADDRESS

MySQL C mysql.com/cluster

Volt DB voltdb.com

Clustrix clustrix.com

ScaleDB scaledb.com

Scale Base scalebase.com

Nimbus DB nimbusdb.com

SCALABLE RELATIONAL SYSTEMS

15

Theoretical & Practical aspects (DBMS)

Domains & R Í D1 x D2 x .... Dn, Algebra à

1st Order Predicate Logic

Languages: SQL (wins), QUEL, QBE

DBMS Prototypes (1975), Products (1980)

A major improvement in DB: provide data independence & a simple, tabular view of data

Normal Forms & Dependencies (DB design, consistency)

Controversial: missing values, duplicates

More than 30 years: maturity!

R x S

R È S

R - S

R[a]

R : j

-------

R * S

1970 - 2000 RELATIONAL DB

16

https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_join_inner2

HANDS ON EXAMPLE

+

22/09/2019

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

- Relational- Key-Value- Column oriented Tabular- Document oriented

Raw data collections

- How to transform data collections ?- Which is the best adapted model?

à Polyglot persistence

Approaches dealing with transformation rules inspired in therelational case

tabular (csv, excel)

Media (XML, JSON, BLOB)Graph

MODELING DATA COLLECTIONS

18

External medium (tape or cards)Data dedicated to one applicationData seen as flowing through a stationary processor

Not much different from Hollerith’s tabulator“Batch Processing”

Invoice

Master Tape Updated Master

Orders

Computer

SEQUENTIAL DATA PROCESSING

19

The key property of disks: Random access to stored data

Records can contain pointers to other records

Records can be indexed by their values and accessed directly using B trees

Prof. Rudolf Bayer, Tech. Univ Munich

Don Chamberlain

record

record

record

DISKS MADE MODERN DATABASES POSSIBLE

20

Jim Gray

IDENTIFICATION: document

ENVIRONMENT: OS

DATA: Files/Records

PROCEDURE: code

COBOL

AUTHOR, PROGRAM-ID, INSTALLATION, SOURCE-COMPUTER, OBJECT-COMPUTER, SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL, DATE-WRITTEN, DATE-COMPILED, SECURITY.

CONFIGURATION SECTION. INPUT-OUTPUT SECTION.

FILE SECTION. WORKING-STORAGE SECTION. LINKAGE SECTION. REPORT SECTION. SCREEN SECTION.

• CODASYL – DBTG (1967) COnference on DAta SYstems LanguagesData Base Task Group• Defined DDL for a network data model

Set-Relationship semantics Cursor Verbs

• Isolated from procedures• No encapsulation• DATA division is the Schema ancestor

“Us”

“them”

CODE & DATA: SEPARATED AT BIRTH

21

qMany applications share data in common

qData is managed by a centralized system qCosts are sharedqRedundancy and inconsistency are minimized

qControl is improvedqAccess language can be standardizedqData becomes an enterprise resource

qShared utilities: backup, recovery, replication, . . .

Database Management System

Schema & Data

INTEGRATED DATABASES (MID 60’S)

+

22/09/2019

CuratedRaw

Increased versatility& complexity

Increased scalability& speed

Data collections rawness degree

Key-Valuestores

Documentstores

Extensiblerecord stores

NewSQLRelational databases

Graph Databases

QueryingLook up (R/W) Analytics

AggregationProcessing Navigation

DATA CURATION AT SCALE

+

22/09/2019

Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]

Data Collections StorageNoSQL, NewSQL (Hive), Data sharding (MongoDB, CouchDB)

Harvesting &Cleaning

Preserving

Processing &Analysis

Exploiting

DATA COLLECTIONS LIFE CYCLE

+

22/09/2019

Centralized repository containing virtually inexhaustible amounts of raw data to be analysed

CRM Sensor logs

DATA LAKE

+

22/09/2019

Repositories grow ever bigger and complex to the point that a lake becomes a swamp

CRM Sensor logs

DATA SWAMP

+

22/09/2019

Preserving Describing

Step 2: Extracting meta-data

Grooming

Provisioning

Step 3: Exploitation

Selecting

Vetting

Collecting

Step 1: Harvesting

Data wrangling [Terrizzano et al. 2015]

DATA EXPLORATION WORKFLOW

+

22/09/2019

Quantitative meta-data(e.g. Compass MongoDB)

Structural meta-data(e.g. Schemata like DTD, XML schema)

similar

similar

analysed

Eco-Bike

Traffic

Ontology

Semantic meta-data(linked data approaches)

Linked data collections- Topic- Geographic location- Temporal releases

EXTRACTING META-DATA

[Constance, SIGMOD ’16 ]

+

22/09/2019


Extracting meta-data

ExploringHarvesting

Level INo meta-data Partial meta-data Complete meta-data

Level 2 Level 3

Sensing &collectingraw data

Size, frequencyfreshness, type

Quantitative &qualitative data+ semantics

[Stonebraker et al. 2015]

information


+

22/09/2019

Ontology

exploreCollections( )

ConstanceSemantic meta-datadiscovery

Structural meta-datadiscovery

Which are available green transportmodes in the city?

Which are the attributes used for modelling transport modes

Are there data collections dealing with green transport?

TransportGreen transportCycling road

Electric car

FISHING DATA IN THE LAKE


+

22/09/2019

Eco Bikes

Electric busesBuses

Car rental

Green transport

Downtown Lyon

Transport for rental

Data regarding traffic in downtown, by date, traffic data collected in January 2018.



+

22/09/2019


Extracting meta-dataExploringHarvesting

ETL

Parallel Data Processing Platforms

Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)

Parallel Data Processing Platforms

NoSQL & NewSQL(Parallel)

ParallelData Querying &

Analytics

Structured Data provision

Parallel data collection

(Flink, Stream, Flume)

Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)

Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)

PREPARING DATA COLLECTIONS

DATA SCIENCE TOOLBOXES

32

- Data loading- In memory/cache/disk indexing - Data persistence - Query optimization- Concurrent access- Consistency and access control

33

Functions must be revisited under less strong hypothesis to support theenactment of data science pipelines

DATA CONSUMPTION PHILOSOPHIES

34

ENACTMENT ENVIRONMENTSBig Data Platforms & Stacks WIDE environments Machine Learning Services

35

WHICH TOOLS FOR DATA SCIENCE

Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and RFundamental libraries for data scientists in Python: NumPy, SciPy, Pandas and Scikit-Learn- NumPy provides, support for multidimensional arrays with basic operations on them and useful linear algebra

functions. - SciPy provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing,

optimization, statistics. - Matplotlib tools for data visualization - Sci-kit-Learn: machine learning library with tools such as classification, regression, clustering, dimensionality

reduction, model selection, and preprocessing. - Pandas: provides high-performance data structures and data analysis tools for data manipulation with integrated

indexing

36

DATA SCIENCE ECOSYSTEM & INTEGRATED DEVELOPMENT ENVIRONMENT

To get started on solving data-oriented problems, we need to set up our programming environment

Decide programming language version, whether to install a data scientist ecosystem by individual tool-boxes, or to perform a bundle installation

- For example, Anaconda Python provides integration of all the Python toolboxes and applicationsfor data scientists in a single directory

The integrated development environment (IDE) is an essential tool designed tomaximize programmer productivity.- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the

debugger.- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)

37

WEB INTEGRATED DEVELOPMENT ENVIRONMENT

Web-based IDEs were developed considering how not only your code but also all your environment andexecutions can be stored in a server.- The server can be set up in a centre, such as a university or school- Students can work on their homework either in the classroom or at home- Students can execute all the previous steps over and over again, and then change some particular code cell (a

segment of the document that may content source code that can be executed) and execute the operation again.

For example IPython has been issued as a browser version of its interactive console: Jupyter- Markdown (a wiki text language) cells can be added to introduce algorithms. - It is also possible to insert Matplotlib graphics to illustrate examples or even web pages. - Experiments can become completely and absolutely replicable.

38

WEB INTEGRATED DEVELOPMENT ENVIRONMENT

39

DATA SCIENCE VIRTUAL MACHINE

COMPUTING CAPACITY

40

Genoveva Vargas-SolarFrench Council of Scientific Research, LIG

[email protected]

Exploring data collectionsDescriptive statistics

http://vargas-solar.com/data-centric-smart-everything/

Thanks to Prof. J.L. Zechinelli Martini, UDLAP-LAFMIA, Mexico for our collaborative construction of these slides

* This presentation was created using the content of https://www.python.org/

https://www.python.org/

Helps to simplify large amounts of data in a sensible way

A simple way to describe the data

Presenting quantitative descriptions in a manageable form

Main steps

Data preparation: generate statistically valid descriptions

Descriptive statistics: Generate different statistics to

Describe and summarize the data concisely

Evaluate different ways to visualize them

42

DESCRIPTIVE STATISTICS

PREPARING DATA

43

Obtaining the data: Read from a file or obtained by scraping the web

Parsing the data: Format the data which can be in plain text, fixed columns, CSV, XML, HTML, etc.

Cleaning the data: A simple strategy is to remove or ignore incomplete records

Building data structures: A data structure that lends itself to the analysis we are interested in.

Databases provide a mapping from keys to values, so they serve as dictionaries

44

PREPARING DATA

Financial parameters related to the US population*

§Features: Age, sex, marital, country, income, education, occupation, capital gain, etc.

§Question: Are men more likely to become high-income professionals than women, i.e., to receive an income of over $50,000 per year?

§Preparing data collections

§Read and check the data

§Represent the data, for instance using a tabular data structure with features (columns) and records (rows)

§Group the data

45* UCI’s Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Adult

ANALYSING INCOME ACCORDING TO GENDER

https://archive.ics.uci.edu/ml/datasets/Adult

Measurements and categories represent a sample distribution of a variable:

which approximately represents the population distribution of the variable

to make tentative assumptions about the population distribution

Different techniques:

Summarizing the data

Data distributions

Outlier treatment

Kernel density

46

EXPLORATORY DATA ANALYSIS

https://www.kaggle.com/robikscube/hourly-energy-consumption

47

DATA ANALYSIS TECHNIQUES

49

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 50

Given lots of data

Discover patterns and models that are:Valid: hold on new data with some certaintyUseful: should be possible to act on the item Unexpected: non-obvious to the systemUnderstandable: humans should be able to

interpret the pattern

PRINCIPLE

51

Based on three guiding principles

Decision backwards

Step by step

Test and learnPredictive &Optimization

ModelsOrganizationaltransformation

Big Data

CAPTURING VALUE FROM ADVANCED ANALYTICS

53

STATISTICS

Concepts:- Population: collection of objects, items (“units”) about which information is sought. - Sample: a part of the population that is observed.

Descriptive statistics: simplify large amounts of data in a sensible way presenting quantitative descriptions in a manageable way. It is a way to describe data.- Applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. - These procedures are essential to provide summaries about the samples as an approximation of the population. - Together with simple graphics, they form the basis of every quantitative analysis of data.

Inferential statistics: draws conclusions beyond the analysed data ; reaches conclusions regarding made hypotheses; aims at inferring characteristics of the “population” of the data.

54

DATA FOR STATISTICS

To describe the sample data and to infer any conclusion data preparation is required for generating statistically valid descriptions & conclusions

1. Harvesting from a file or obtained from sources (archives, data stores, sensors).

2. Parsing depends on what format the data are in for example plain text, fixed columns, CSV, XML, HTML, etc.

3. Cleaning : Survey responses and other data files are almost always incomplete. Sometimes, there are multiple codes for things such as, not asked, did not know, and declined to answer. And there are almost always errors.

4. Building data structures: store data in a data structure that lends itself to the analysis we are interested in. - If the data fit into the memory, building a data structure is usually the way to go. - If not, usually a database is built, which is an out-of-memory data structure. - Most databases provide a mapping from keys to values, so they serve as dictionaries.

Computer Science

Artificial Intelligence

CuratedKnowledge

MachineLearning

ReverseEngineeringThe Brain

Data Mining

55

IA & DATA MINING

J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 56

Data mining overlaps with: Databases: Large-scale data, simple queries

Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms

Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that

examine large amounts of data Result is the query answer

To a ML person, data-mining is the inference of models Result is the parameters of the model

In this class we will do both!

Machine Learning

CSTheory

Data Mining

Database systems

DATA MINING CULTURES

+

22/09/2019

Description

Prediction

Clustering (Bayesian, hierarchical, k-means,

CLARA, PAM)

Classification(Neural network, PLSDA, KNN,

decision trees)

Trend prediction

Regression(PLSR, PCR)

Association

Modelling (LU, QR, PCA=SVD, PARAFAC)

How many accidents are reported per day?

Percentage of use of available bicycles in downtown?

Which are the traffic bottle-neck regions in the city?

Is the number of car accidents related to seasons?

How will the use of bicycles will evolve in downtown during the summer of the next 5 years?

What type of cars are those that have more accidents in the highspeed roads?

Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?

PROCESSING DATA COLLECTIONS

+

22/09/2019

Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]

Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon

DescriptionClustering

PCA & k-means

Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data comes to lie on the first

coordinate (first principal component), - the second greatest variance on the second coordinate, and so on.


+

22/09/2019

Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]

Contribution: PCA for identifying traffic anomalies and thereby detecting problems in roads

DescriptionModelling

Principal Component Analysis (PCA)