designing experiments - vargas-solarvargas-solar.com/data-centric-smart-everything/wp... ·...
TRANSCRIPT
Genoveva Vargas-SolarFrench Council of Scientific Research, LIG
Designing ExperimentsOverview of data management & exploitation solutions
http://vargas-solar.com/data-centric-smart-everything/
2
Volume
VelocityVariety
Veracity
Value
1000 Yottabytes 1 Brontobyte
1000Brontobytes
1 Geopbyte
Computational Science
Digital humanitiesSocial Data Science
Network Science
Computation(Algorithm: mathematical model)
Experiment setting(Architecture: computing environment)
3
Big data variety: the right model according to data
4
DATABASE ENVIRONMENTS LANDSCAPE
+
22/09/2019
CuratedRaw
Increased versatility& complexity
Increased scalability& speed
Data collections rawness degree
Key-Valuestores
Documentstores
Extensiblerecord stores
NewSQLRelational databases
Graph Databases
QueryingLook up (R/W) Analytics
AggregationProcessing Navigation
DATA CURATION AT SCALE
6
Tuple Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar
Document Allows values to be nested documents or lists, as well as scalar values. Attributes are not defined in a global schema
Extensible record Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be
added on a per-record basis
DATA MODELS
7
Key-value Systems that store values and an index to find them, based on a key
Document Systems that store documents, providing index and simple query mechanisms
Extensible record Systems that store extensible records that can be partitioned vertically and horizontally across nodes
Graph Systems that store model data as graphs where nodes can represent content modelled as document or key-value
structures and arcs represent a relation between the data modelled by the node
Relational Systems that store, index and query tuples
DATA STORES
8
“Simplest data stores” use a data model similar to the memcached distributed in-memory cache
Single key-value index for all data
Provide a persistence mechanism
Replication, versioning, locking, transactions, sorting
API: inserts, deletes, index lookups
No secondary indices or keys
SYSTEM ADDRESS
Redis code.google.com/p/redis
Scalaris code.google.com/p/scalaris
Tokyo tokyocabinet.sourceforge.net
Voldemort project-voldemort.com
Riak riak.basho.com
Membrain schoonerinfotech.com/products
Membase membase.com
KEY STORE VALUES
SELECT nameFROM groupWHERE gid IN ( SELECT gid
FROM group_memberWHERE uid = me() )
9
SELECT name, pic, profile_urlFROM userWHERE uid = me()
SELECT name, picFROM userWHERE online_presence = "active"
ANDuid IN ( SELECT uid2
FROM friendWHERE uid1 = me() )
SELECT nameFROM friendlistWHERE owner = me()
SELECT message, attachmentFROM streamWHERE source_id = me() AND type = 80
https://developers.facebook.com/docs/reference/fql/
10
<805114856,
>
11
Support more complex data: pointerlessobjects, i.e., documents
Secondary indexes, multiple types of documents (objects) per database, nested documents and lists, e.g. B-trees
Automatic sharding (scale writes), no explicit locks, weaker concurrency (eventual for scaling reads) and atomicity properties
API: select, delete, getAttributes, putAttributes on documents
Queries can be distributed in parallel over multiple nodes using a map-reduce mechanism
SYSTEM ADDRESS
SimpleDB amazon.com/simpledb
Couch DB couchdb.apache.org
Mongo DB mongodb.org
Terrastore code.google.com/terrastore
DOCUMENT STORES
12
DOCUMENT STORES
13
Basic data model is rows and columns
Basic scalability model is splitting rows and columns over multiple nodes Rows split across nodes through sharding on the primary key
Split by range rather than hash function
Rows analogous to documents: variable number of attributes, attribute names must be unique
Grouped into collections (tables)
Queries on ranges of values do not go to every node
Columns are distributed over multiple nodes using “column groups” Which columns are best stored together
Column groups must be pre-defined with the extensible record stores
SYSTEM ADDRESS
HBase hbase.apache.com
HyperTable hypertable.org
Cassandra incubator.apache.org/cassandra
EXTENSIBLE RECORD STORES
14
SQL: rich declarative query language
Databases reinforce referential integrity
ACID semantics
Well understood operations: Configuration, Care and feeding, Backups, Tuning, Failure and recovery, Performance characteristics
Use small-scope operationsChallenge: joins that do not scale with sharding
Use small-scope transactionsACID transactions inefficient with communication and 2PC overhead
Shared nothing architecture for scalability
Avoid cross-node operations
SYSTEM ADDRESS
MySQL C mysql.com/cluster
Volt DB voltdb.com
Clustrix clustrix.com
ScaleDB scaledb.com
Scale Base scalebase.com
Nimbus DB nimbusdb.com
SCALABLE RELATIONAL SYSTEMS
15
Theoretical & Practical aspects (DBMS)
Domains & R Í D1 x D2 x .... Dn, Algebra à
1st Order Predicate Logic
Languages: SQL (wins), QUEL, QBE
DBMS Prototypes (1975), Products (1980)
A major improvement in DB: provide data independence & a simple, tabular view of data
Normal Forms & Dependencies (DB design, consistency)
Controversial: missing values, duplicates
More than 30 years: maturity!
R x S
R È S
R - S
R[a]
R : j
-------
R * S
1970 - 2000 RELATIONAL DB
16
https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_join_inner2
HANDS ON EXAMPLE
+
22/09/2019
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
- Relational- Key-Value- Column oriented Tabular- Document oriented
Raw data collections
- How to transform data collections ?- Which is the best adapted model?
à Polyglot persistence
Approaches dealing with transformation rules inspired in therelational case
tabular (csv, excel)
Media (XML, JSON, BLOB)Graph
MODELING DATA COLLECTIONS
18
External medium (tape or cards)Data dedicated to one applicationData seen as flowing through a stationary processor
Not much different from Hollerith’s tabulator“Batch Processing”
Invoice
Master Tape Updated Master
Orders
Computer
SEQUENTIAL DATA PROCESSING
19
The key property of disks: Random access to stored data
Records can contain pointers to other records
Records can be indexed by their values and accessed directly using B trees
Prof. Rudolf Bayer, Tech. Univ Munich
Don Chamberlain
record
record
record
DISKS MADE MODERN DATABASES POSSIBLE
20
Jim Gray
IDENTIFICATION: document
ENVIRONMENT: OS
DATA: Files/Records
PROCEDURE: code
COBOL
AUTHOR, PROGRAM-ID, INSTALLATION, SOURCE-COMPUTER, OBJECT-COMPUTER, SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL, DATE-WRITTEN, DATE-COMPILED, SECURITY.
CONFIGURATION SECTION. INPUT-OUTPUT SECTION.
FILE SECTION. WORKING-STORAGE SECTION. LINKAGE SECTION. REPORT SECTION. SCREEN SECTION.
• CODASYL – DBTG (1967) COnference on DAta SYstems LanguagesData Base Task Group• Defined DDL for a network data model
Set-Relationship semantics Cursor Verbs
• Isolated from procedures• No encapsulation• DATA division is the Schema ancestor
“Us”
“them”
CODE & DATA: SEPARATED AT BIRTH
21
qMany applications share data in common
qData is managed by a centralized system qCosts are sharedqRedundancy and inconsistency are minimized
qControl is improvedqAccess language can be standardizedqData becomes an enterprise resource
qShared utilities: backup, recovery, replication, . . .
Database Management System
Schema & Data
INTEGRATED DATABASES (MID 60’S)
+
22/09/2019
CuratedRaw
Increased versatility& complexity
Increased scalability& speed
Data collections rawness degree
Key-Valuestores
Documentstores
Extensiblerecord stores
NewSQLRelational databases
Graph Databases
QueryingLook up (R/W) Analytics
AggregationProcessing Navigation
DATA CURATION AT SCALE
+
22/09/2019
Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]
Data Collections StorageNoSQL, NewSQL (Hive), Data sharding (MongoDB, CouchDB)
Harvesting &Cleaning
Preserving
Processing &Analysis
Exploiting
DATA COLLECTIONS LIFE CYCLE
+
22/09/2019
Centralized repository containing virtually inexhaustible amounts of raw data to be analysed
CRM Sensor logs
DATA LAKE
+
22/09/2019
Repositories grow ever bigger and complex to the point that a lake becomes a swamp
CRM Sensor logs
DATA SWAMP
+
22/09/2019
Preserving Describing
Step 2: Extracting meta-data
Grooming
Provisioning
Step 3: Exploitation
Selecting
Vetting
Collecting
Step 1: Harvesting
Data wrangling [Terrizzano et al. 2015]
DATA EXPLORATION WORKFLOW
+
22/09/2019
Quantitative meta-data(e.g. Compass MongoDB)
Structural meta-data(e.g. Schemata like DTD, XML schema)
similar
similar
analysed
Eco-Bike
Traffic
Ontology
Semantic meta-data(linked data approaches)
Linked data collections- Topic- Geographic location- Temporal releases
EXTRACTING META-DATA
[Constance, SIGMOD ’16 ]
+
22/09/2019
Preserving Describing
Extracting meta-data
ExploringHarvesting
Level INo meta-data Partial meta-data Complete meta-data
Level 2 Level 3
Sensing &collectingraw data
Size, frequencyfreshness, type
Quantitative &qualitative data+ semantics
[Stonebraker et al. 2015]
information
EXTRACTING META-DATA
+
22/09/2019
Ontology
exploreCollections( )
ConstanceSemantic meta-datadiscovery
Structural meta-datadiscovery
Which are available green transportmodes in the city?
Which are the attributes used for modelling transport modes
Are there data collections dealing with green transport?
TransportGreen transportCycling road
Electric car
FISHING DATA IN THE LAKE
[Constance, SIGMOD ’16 ]
+
22/09/2019
Eco Bikes
Electric busesBuses
Car rental
Green transport
Downtown Lyon
Transport for rental
Data regarding traffic in downtown, by date, traffic data collected in January 2018.
EXTRACTING META-DATA
[Constance, SIGMOD ’16 ]
+
22/09/2019
Preserving Describing
Extracting meta-dataExploringHarvesting
ETL
Parallel Data Processing Platforms
Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)
Parallel Data Processing Platforms
NoSQL & NewSQL(Parallel)
ParallelData Querying &
Analytics
Structured Data provision
Parallel data collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)
PREPARING DATA COLLECTIONS
DATA SCIENCE TOOLBOXES
32
- Data loading- In memory/cache/disk indexing - Data persistence - Query optimization- Concurrent access- Consistency and access control
33
Functions must be revisited under less strong hypothesis to support theenactment of data science pipelines
DATA CONSUMPTION PHILOSOPHIES
34
ENACTMENT ENVIRONMENTSBig Data Platforms & Stacks WIDE environments Machine Learning Services
35
WHICH TOOLS FOR DATA SCIENCE
Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and RFundamental libraries for data scientists in Python: NumPy, SciPy, Pandas and Scikit-Learn- NumPy provides, support for multidimensional arrays with basic operations on them and useful linear algebra
functions. - SciPy provides a collection of numerical algorithms and domain-specific toolboxes, including signal processing,
optimization, statistics. - Matplotlib tools for data visualization - Sci-kit-Learn: machine learning library with tools such as classification, regression, clustering, dimensionality
reduction, model selection, and preprocessing. - Pandas: provides high-performance data structures and data analysis tools for data manipulation with integrated
indexing
36
DATA SCIENCE ECOSYSTEM & INTEGRATED DEVELOPMENT ENVIRONMENT
To get started on solving data-oriented problems, we need to set up our programming environment
Decide programming language version, whether to install a data scientist ecosystem by individual tool-boxes, or to perform a bundle installation
- For example, Anaconda Python provides integration of all the Python toolboxes and applicationsfor data scientists in a single directory
The integrated development environment (IDE) is an essential tool designed tomaximize programmer productivity.- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the
debugger.- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)
37
WEB INTEGRATED DEVELOPMENT ENVIRONMENT
Web-based IDEs were developed considering how not only your code but also all your environment andexecutions can be stored in a server.- The server can be set up in a centre, such as a university or school- Students can work on their homework either in the classroom or at home- Students can execute all the previous steps over and over again, and then change some particular code cell (a
segment of the document that may content source code that can be executed) and execute the operation again.
For example IPython has been issued as a browser version of its interactive console: Jupyter- Markdown (a wiki text language) cells can be added to introduce algorithms. - It is also possible to insert Matplotlib graphics to illustrate examples or even web pages. - Experiments can become completely and absolutely replicable.
38
WEB INTEGRATED DEVELOPMENT ENVIRONMENT
39
DATA SCIENCE VIRTUAL MACHINE
COMPUTING CAPACITY
40
Genoveva Vargas-SolarFrench Council of Scientific Research, LIG
Exploring data collectionsDescriptive statistics
http://vargas-solar.com/data-centric-smart-everything/
Thanks to Prof. J.L. Zechinelli Martini, UDLAP-LAFMIA, Mexico for our collaborative construction of these slides
* This presentation was created using the content of https://www.python.org/
Helps to simplify large amounts of data in a sensible way
A simple way to describe the data
Presenting quantitative descriptions in a manageable form
Main steps
Data preparation: generate statistically valid descriptions
Descriptive statistics: Generate different statistics to
Describe and summarize the data concisely
Evaluate different ways to visualize them
42
DESCRIPTIVE STATISTICS
PREPARING DATA
43
Obtaining the data: Read from a file or obtained by scraping the web
Parsing the data: Format the data which can be in plain text, fixed columns, CSV, XML, HTML, etc.
Cleaning the data: A simple strategy is to remove or ignore incomplete records
Building data structures: A data structure that lends itself to the analysis we are interested in.
Databases provide a mapping from keys to values, so they serve as dictionaries
44
PREPARING DATA
Financial parameters related to the US population*
§Features: Age, sex, marital, country, income, education, occupation, capital gain, etc.
§Question: Are men more likely to become high-income professionals than women, i.e., to receive an income of over $50,000 per year?
§Preparing data collections
§Read and check the data
§Represent the data, for instance using a tabular data structure with features (columns) and records (rows)
§Group the data
45* UCI’s Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Adult
ANALYSING INCOME ACCORDING TO GENDER
Measurements and categories represent a sample distribution of a variable:
which approximately represents the population distribution of the variable
to make tentative assumptions about the population distribution
Different techniques:
Summarizing the data
Data distributions
Outlier treatment
Kernel density
46
EXPLORATORY DATA ANALYSIS
https://www.kaggle.com/robikscube/hourly-energy-consumption
47
48
DATA ANALYSIS TECHNIQUES
49
J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 50
Given lots of data
Discover patterns and models that are:Valid: hold on new data with some certaintyUseful: should be possible to act on the item Unexpected: non-obvious to the systemUnderstandable: humans should be able to
interpret the pattern
PRINCIPLE
51
Based on three guiding principles
Decision backwards
Step by step
Test and learnPredictive &Optimization
ModelsOrganizationaltransformation
Big Data
CAPTURING VALUE FROM ADVANCED ANALYTICS
52
53
STATISTICS
Concepts:- Population: collection of objects, items (“units”) about which information is sought. - Sample: a part of the population that is observed.
Descriptive statistics: simplify large amounts of data in a sensible way presenting quantitative descriptions in a manageable way. It is a way to describe data.- Applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. - These procedures are essential to provide summaries about the samples as an approximation of the population. - Together with simple graphics, they form the basis of every quantitative analysis of data.
Inferential statistics: draws conclusions beyond the analysed data ; reaches conclusions regarding made hypotheses; aims at inferring characteristics of the “population” of the data.
54
DATA FOR STATISTICS
To describe the sample data and to infer any conclusion data preparation is required for generating statistically valid descriptions & conclusions
1. Harvesting from a file or obtained from sources (archives, data stores, sensors).
2. Parsing depends on what format the data are in for example plain text, fixed columns, CSV, XML, HTML, etc.
3. Cleaning : Survey responses and other data files are almost always incomplete. Sometimes, there are multiple codes for things such as, not asked, did not know, and declined to answer. And there are almost always errors.
4. Building data structures: store data in a data structure that lends itself to the analysis we are interested in. - If the data fit into the memory, building a data structure is usually the way to go. - If not, usually a database is built, which is an out-of-memory data structure. - Most databases provide a mapping from keys to values, so they serve as dictionaries.
Computer Science
Artificial Intelligence
CuratedKnowledge
MachineLearning
ReverseEngineeringThe Brain
Data Mining
55
IA & DATA MINING
J. LESKOVEC, A. RAJARAMAN, J. ULLMAN: MINING OF MASSIVE DATASETS, HTTP://WWW.MMDS.ORG 56
Data mining overlaps with: Databases: Large-scale data, simple queries
Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms
Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that
examine large amounts of data Result is the query answer
To a ML person, data-mining is the inference of models Result is the parameters of the model
In this class we will do both!
Machine Learning
CSTheory
Data Mining
Database systems
DATA MINING CULTURES
57
+
22/09/2019
Description
Prediction
Clustering (Bayesian, hierarchical, k-means,
CLARA, PAM)
Classification(Neural network, PLSDA, KNN,
decision trees)
Trend prediction
Regression(PLSR, PCR)
Association
Modelling (LU, QR, PCA=SVD, PARAFAC)
How many accidents are reported per day?
Percentage of use of available bicycles in downtown?
Which are the traffic bottle-neck regions in the city?
Is the number of car accidents related to seasons?
How will the use of bicycles will evolve in downtown during the summer of the next 5 years?
What type of cars are those that have more accidents in the highspeed roads?
Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?
PROCESSING DATA COLLECTIONS
+
22/09/2019
Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]
Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon
DescriptionClustering
PCA & k-means
Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data comes to lie on the first
coordinate (first principal component), - the second greatest variance on the second coordinate, and so on.
PROCESSING DATA COLLECTIONS
+
22/09/2019
Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]
Contribution: PCA for identifying traffic anomalies and thereby detecting problems in roads
DescriptionModelling
Principal Component Analysis (PCA)
PROCESSING DATA COLLECTIONS
61