experimental methods and techniques in ...home.deib.polimi.it/schiaffo/cs/experimental methods...

EXPERIMENTAL METHODS AND TECHNIQUES IN ENGINEERING

The Data Management Perspective

Fabio A. Schreiber

Politecnico di Milano

Dipartimento di Elettronica, Informazione e Bioingegneria

THE DATA MANAGEMENT PERSPECTIVE

Experiments on Databases and DBMSs

Data organization and management as a service to the experiments of the scientific community

Experimenting with the Database content itself

F. A. Schreiber Experimental Methods ... Data Perspective

1

EXPERIMENTS & DATA MANAGEMENT

Experiments for optimizing data structures and management Database Management System (DBMS) Data Structures Conceptual/Logical Schema optimization and evolution Physical structures design

Data organization and management for collecting experimental results (e-Science)

Exploring Database content (data mining) Assessing Data Quality

F. A. Schreiber Experimental Methods ... Data erspective

2

EXPERIMENTS & DATA MANAGEMENT


Goals Systems Performance Evaluation and Tuning How performant a system is? How can I improve its performance?

Benchmarking Comparison among different systems under similar workload

System Effectiveness How much a system conforms to the user’s needs

w.r.t. a defined metric

3

WORKLOAD AND FACTORS


Synthetic vs. Real Synthetic workload allows for controlled experiment

repeteability. Useful in systems comparison Real workload can be highly variable and can be used in

assessing the overall performance of a single system in its real environment

Single-user (to test specific algorithms) vs. Multi-user (to test system procedures) Multiprogramming level Query mix Degree of data sharing (buffer and cache sizes)

4

FACTORS IN DBMS PERFORMANCE EVALUATION (Boral & DeWitt 84)[2]


Multiprogramming level (MPL) Number of concurrent queries in any phase of execution Use precompiled queries and minimize the data volume of the results in order to exibit as much as possible the true «execution» time

5

FACTORS IN DBMS PERFORMANCE EVALUATION


Degree of Data Sharing (DDS) Concurrent access affects both data pages (rare) and index

pages (frequent) Expressed as a percentage of the multiprogramming level: 0% each query references only its partition 100% all queries reference the same partition 0%<DDS<100%

Queries randomly distributed among partitions Application programs uniformly distributed among partitions

The DDS affects the buffer pages replacement algorithm MRU best for relational operators LRU best for replacement of shared data pages

6

FACTORS IN DBMS PERFORMANCE EVALUATION


Query Mix Selection (multiuser) (Boral & DeWitt 84)[2]

Consumed resources CPU cycles: actual query execution, access path selection,

buffer pool management, OS disk operations Disk bandwith: get/store data, page swapping

Query type CPU Disk Query example I Low

0.18 s Low 2-3

Select one tuple from 10000 using a clustered index

II Low 0.90 s

High 91

Select 100 tuples from 10000 using a non-clustered index

III High 18.96 s

Low 206

Join 10000 tuples with 1000 tuples using a clustered index on the join attribute of the first relation

IV High 35.62 s

High 1008

Aggregate function on 10000 tuple relation (100 partitions)

7

METRIC (Schwartz 11) [15]


Measured entities Observation interval (OI) Number of queries in the observation interval (NQ) Busy time: total time of the queries in the system (BT) Weighted time: total execution time of the queries (WT)

Derived variables Throughput: NQ/OI Execution time: WT/NQ Concurrency: WT/OI Utilization: BT/OI

8

METRIC


Si,j , Ei,j starting and ending times of the jth query of the i concurrent program 1≤ i ≤ MPL , 1≤j≤N

Tlast-to-start = max{Si,1 , 1≤ i ≤ MPL } ; Tfirst-to-finish = min{Ei,N , 1≤ i ≤ MPL }

NQ number of totally executed queries

System throughput NQ/(Tfirst-to-finish - Tlast-to-start) Average Response time Σ exec_timesNQ /NQ

t

MPL

Tlast-to-start Tfirst-to-finish

Number of totally executed queries NQ

9

EXPERIMENTAL MODALITY


Simulation vs. real life Testbeds simulation often provides imprecise results owing to many

parameters of the system which are not accounted for by the simulation programs

testbeds with a very large number of components are very difficult, if not impossible, to organize

use testbeds to tune and calibrate simulation

programs???

Repeatability is essential for the experiment credibility (Manolescu &Al.) [8]

10

A DBMS QUEUING MODEL


USERS

TRANSACTION REQUEST

PRIORITY ASSIGNMENT

RESTART RESUBMIT TERMINATE

COMMIT ABORT REQUEST/RELEASE

A DATA OBJECT CONCURR. CONTROL

BLOCK WAIT DB OPERATION

BUFFER ACCESS

HIT

MISS DISK

ACCESS

COMPUTATION

CPU

11

SCALABILITY


Systems COMPLEXITY The memory and time behaviour of algorithms cannot be

inferred by testing systems composed of only a bounce of nodes: constants matter!

n

O(n)

O(n2) O(2n)

O(log n)

12

DATABASE SYSTEMS BENCHMARKS


Useful for comparing different DBMS Systems must be fully installed and operational They rely on the effectiveness of synthetic

workloads Benchmarking is an experimental activity which

requires three steps (as usual): Design Execution

Analysis

13



14

The good thing about standards is that there are so many of them

Unknown

… and each one has so many options that can be chosen at will …



Transaction Processing Performance Council (TPC) [16] TPC-C: for OLTP systems. It simulates a multi-user environment

making concurrent queries to a central Database. Suited for on-line handling of orders and for managing inventories.

TPC-E: similar to TPC-C, but with transactions designed for brokerage environments such as on-line trading, market research, account inquiries, …

TPC-H: tuned for Decision Support Systems, complex data mining queries, concurrent data modifications.

Other benchmarks for specific products (Oracle, MySQL, …) or functionalities (security, web servers, …)

15

DATA ORGANIZATION AND MANAGEMENT FOR COLLECTING EXPERIMENTAL RESULTS (e-Science) (Vanschoren & Blockeel 10) [14]


Experimental data collection create searchable , community-wide repositories to

automatically publish experimental results on-line a formal experiment description language to import a large

number of experiments and make them immediately available to everyone

ontologies providing a controlled vocabulary clearly describing the interpretation of each concept

Generate a collaborative approach to experimentation Experiments freely shared Linked together Reused by querying and data mining

16

e-SCIENCES


17

Computationally intensive sciences, which use the internet as a global collaborative workspace Bioinformatics Microarrays (Stoeckert & Al. 02) [12]

Proteomics (Masseroli 07) [10]

Astronomy Virtual observatories (Szalay & Gray 01) [13]

Physics High energy nuclear physics (Brown & 07) [3]

e-SCIENCES


18

e-science applications as well as other Web Information Systems share a collaborative and distributed nature of their development and content management (Curino & Al. 08) [5,6]

Evolution in time DB migration

While preserving the past contents of the DB and the history of its schema

Applications maintenance while allowing legacy applications to access new contents

through old schema versions

DATA MODELS AND STRUCTURES MAINTENANCE (Marche 93) [9], (Sjoberg 93) [11], (Curino & Al. 08) [5,6]


19

Conceptual/Logical level Schema evolution

Restructuring Optimization ……….

Observational study (analog to natural sciences) made on the evolution of the Wikipedia Database schema Goal:

Create a benchmark for schema evolution (and in general a standard relational DB dataset).

Extend the analysis to several other Open-Source WIS (Joomla!,TikiWiki, Slashcode, Zen-Cart, Wordpress)

Extend the analysis towards Public Scientific DB (Genome, HGVS)

EXPERIMENTS ON SCHEMA EVOLUTION: the Wikipedia case (Curino & Al. 08) [5,6]


20

• Schema Evolution: • 170+ versions in 4.5 years • almost 250% increase • WIS evolve faster than Traditional IS • 38% w.r.t. [Sjoberg93] • 539% w.r.t. [Marche93]

EXPERIMENTS ON SCHEMA EVOLUTION: the Wikipedia case


21

Previous queries on new schema

Major restructuring

EXPERIMENTS ON SCHEMA EVOLUTION: the Wikipedia case


22

New queries on all previous schema versions

Major restructuring

DATA MODELS AND STRUCTURES MAINTENANCE (Babu & Al. 09)[1], (Davcev & Al. 08) [7]


23

Physical level Tables

Sorting

Clustering

……….

Indexes Trees

Hashing

……….

Memory Shared buffers

Cache size

……….

EXPLORING DATABASE CONTENT


24

EVERY HUMAN KNOWLEDGE STARTS FROM

INTUITIONS, PROCEEDS THROUGH CONCEPTS, AND

REACHES ITS CLIMAX WITH IDEAS

I. Kant

KNOWLEDGE HIERARCHY


25

ELEMENTS (VOLUME)

VARIABLES

VALUE ADDED

EXPERIENCE STATISTICAL PROCESSING

KNOWLEDGE DISCOVERY

PROCEDURES

INVOICES

DATA

SALES TREND

INFORMATION

STRATEGIC DECISIONS

WISDOM

MARKET RULES

KNOWLEDGE

KNOWLEDGE DISCOVERY AND DATA MINING


26

Knowledge Discovery in Databases and Data Warehouses To identify the most significant information To show it to the user in the most convenient way

Data Mining Algorithm application to raw data in order to extract

knowledge (relations, paths, …) Predictive aim (signal analysis, voice recognition, ecc.) Descriptive aim (decision support systems, natural sciences)

WHAT KIND OF INFORMATION DO WE GET?


27

Associations Set of rules specifying the joint occurrence of two (or more)

elements Sequences

Possibility of stating temporal sequences of events Classifications

Grouping of elements into classes following a given model Clusters

Grouping of elements into classes which have not been defined a-priori

Trends Discovery of peculiar temporal paths having a forecasting

value

KNOWLEDGE DISCOVERY PROCESS (1)


28

Even if specialized tools are available it requires A competence in used techniques A very good application domain knowledge

Sequential steps Selection

Choice of the sample data the analysis shall be focused on

Preprocessing Data sampling in order to reduce their volume Data scrubbing for errors and omissions



29

Transformation Data types homogeneization and/or conversion

Data mining Choice of the method/algorithm

Interpretation and evaluation Retrieved information filtering Possible refining by previous steps repetition Search results visual presentation (graphical or logical)



30

RAW DATA

TARGET DATA

PRE- PROCESSED

DATA TRANSFORMED DATA

CORRELATIONS AND PATHS

KNOWLEDGE

SELECTION

PRE-PROCESSING

TRANSFORMATION

DATA MINING

INTERPRETATION

source: G. Piatesky-Shapiro 1996

DATA MINING ALGORITHMS


31

Model representation Formalisms to represent and describe possible paths

Model evaluation Statistical or logical estimate of the correspondence of a path to the search criteria

Search method Of parameters

Search of the parameters which optimize the evaluation criteria, the observations set and the model representation being given

Of model The parameters are applied to models belonging to the same family, differentiated by the representation type, for quality evaluation

THE “MARKET BASKET” MODEL


32

The best-known model on which data mining techniques are applied

Mainly, but not exclusively, used for retail sale

problems The goal is to discover recurrent patterns in data

(association rules)

THE “MARKET BASKET” MODEL


33

I = {i1, ..., ik} SET OF k ELEMENTS (ITEM)

B = {b1, ..., bn} SET OF n SUBSETS (BASKET) OF I

bi ⊆ I

I Goods in a supermarket

Words in a dictionary

B

A customer’s purchase

A document in a corpus

ASSOCIATION RULE i1 ⇒ i2 i1 AND i2 SHOW TOGETHER IN AT LEAST s% OF THE n BASKET (SUPPORT)

OF ALL THE BASKETS CONTAINING i1 AT LEAST c% CONTAIN ALSO i2 (CONFIDENCE)

THE “MARKET BASKET” MODEL: ANY PROBLEM?


34

c COFFEE IS IN THE BASKET c NO COFFEE IN THE BASKET

t TEA IS IN THE BASKET t NO TEA IN THE BASKET

c c

t

t

Σ rows

Σcolumns

20 5 25

70 5 75

90 10 100

WARNING! A CORRELATION EXISTS BETWEEN TEA AND COFFEE r = P[t ∧ c] / (P[t] x P[c] ) = 0.89

t ⇒ c IS TRUE???

s = 20% c = P[t ∧ c] / P[t] =20/25= 80%

PERHAPS, BUT ... THOSE WHO BUY COFFEE ANYHOW REACH 90% !!!

CLASSIFICATION PROBLEM : AN EXAMPLE


35

AGE CAR TYPE RISK 17 sports high 43 family low 68 family low 32 truck low 23 family high 18 family high 20 family high 45 sports high 50 truck low 64 truck high 46 family low 40 family low

AGE CAR TYPE 22 family 60 family 35 sports

AGE CAR TYPE 22 family 60 family 35 sports

CLASS

high

high low

MINE CLASSIFICATION

TEST

1. IF Age ≤ 23 THEN Risk IS High; 2. IF CarType = sports THEN Risk

IS High; 3. IF CarType IN {family, truck} AND

Age > 23 THEN Risk IS Low; 4. DEFAULT Risk IS Low

EFFECTIVENESS


36

No established results on metric and methodologies Application dependent

Context Physical Sociological

User psychology

DATA QUALITY DIMENSIONS (Cappiello & Schreiber 12) [4]

ACCURACY the degree of conformity of a measured or computed quantity to its actual (true) value (|vavg-vref| < εacc)

PRECISION the degree to which repeated measurement show the same or similar results

(small variance 1/n*ΣNn=1 (vn – μ)2 < εprec )

TIMELINESS

CURRENCY the time interval from the instant the value was sampled to the instant at

which it is sent to the base station

VOLATILITY the amount of time during which data remain valid

Timeliness = max(1 − Currency/Volatility; 0)s


37

BASIC PRINCIPLES OF A PROPOSED AGGREGATION ALGORITHM

Accuracy is represented by the window height Values falling within the window can be considered similar

enough to be fairly represented by their average Values falling outside the window are outliers Outliers can be occasional or consecutive: in any case

outliers information must be preserved for further investigation

v

t

vref

vref+ εacc

Vref- εacc

x x x

x


38

CONSIDERED CASES

OSCILLATORY / BURSTY

EXPECTED TREND SLOW CHANGE

v

t

vref

vref+ εacc

vref- εacc

(b)

v

t

vref

vref+ εacc

vref- εacc

W

H (a)

OUTLIER

By considering Z aggregate values and J outliers out of a set of N measures, the algorithm is considered efficient if the output is composed by (Z+J) values instead of N where (Z+J)<<N


39

ALGORITHM BANDWIDTH

Compressing data amounts to lowering the bandwidth of the measurement system

The window width determines the number of measured values which are aggregated 1 point window no compression max bandwidth

The window width also determines the timeliness by which data are delivered to the base station


40

ALGORITHM INPUT/OUTPUT

INPUT PARAMETERS TIME SERIES V = <v1, v2, … vn> EXPECTED VALUE vref

ACCURACY TOLERANCE εacc

PRECISION TOLERANCE εprec

WINDOW WIDTH N CONTINUITY INTERVAL C

OUTPUT PARAMETERS

AGGREGATE VALUES T = < a1,t1 >; < a2,t2 >; … < az,tz > OUTLIERS O = < o1,t1 >; < o2,t2 >; … < oj,tj >

ALGORITHM COMPLEXITY ALGORITHM FOOTPRINT O(N) 11 KB RAM; 1 KB ROM


41

EXPERIMENTAL SET UP

+ -

R R i

v1

∆V=v2

Z(t)

100 Ώ < ZR(t) < 1000 Ώ (measured)

R = 1 Ώ

0 mV < ΔV < 30 mV

0 mA < i < 30 mA (Data sheet)

R + ZR ≈ ZR


42

7 TRANSMITTED VALUES , 30mJ 60 TRANSMITTED VALUES , 120mJ

ALGORITHM BEHAVIOUR

WITH AGGREGATION WITHOUT AGGREGATION (BYPASS)

7 TRANSMITTED VALUES , 30mJ 60 TRANSMITTED VALUES , 120mJ

70% ENERGY SAVINGS


43

COMPARISON CRITERIA (1/2)

Two real world data sets have been processed by using the algorithm proposed and two other aggregation algorihms: I. Lazaridis, S. Mehrotra, Capturing Sensor-Generated

Time Series with Quality Guarantees, in: ICDE, 2003, pp. 429–439.

T. Schoellhammer, E. Osterweil, B. Greenstein, M. Wimbrow, D. Estrin, Lightweight Temporal Compression of Microclimate Datasets, in: LCN, 2004, pp. 516–524.


44

COMPARISON CRITERIA (2/2)

The comparison among algorithms have been based on three main criteria: Compression rate: the degree with which data have been

aggregated. Energy savings: the degree with which the aggregation

allows sensors to save energy with respect to the case in which all the original values are sent to the base station.

Correctness: the degree with which the aggregated data allow the base station to retrieve the original trend. Correctness has been evaluated by using the Mean Absolute Error (MAE) and the related Mean Absolute Percentage Error (MA%E).


45

DATA SET (A) RESULTS

0,13

0,14

0,15

0,16

0,17

0,18

0,19

0 20 40 60 80 100 120 140 160

CappielloandSchreiber

[V]

[t] [V]


C2N2 absorption spectrum

46

a b c


0,13

0,14

0,15

0,16

0,17

0,18

0,19

0 20 40 60 80 100 120 140 160

Lazaridiset al.

[t]

[V]


47

a b c


0,13

0,14

0,15

0,16

0,17

0,18

0,19

0 20 40 60 80 100 120 140 160

Schoellhammer et al.

[V]

[t] a b c


48


60,00%

65,00%

70,00%

75,00%

80,00%

85,00%

90,00%

[Authors] [Lazaridis et al.] [Schoellhammer et al.]

Compression rate

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

[Authors] [Lazaridis et al.] [Schoellhammer etal. ]

Energy Reduction

MAE in case of non linear trends

49


DATA SET (B) RESULTS

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100 120 140 160[t]

CappielloandSchreiber

Input dataset


50

C2N2 absorption spectrum FM

Systematic error due to the processing time shift


-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100 120 140 160[t]

Lazaridiset al.

Input DataSet

51



-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100 120 140 160[t]

Schoellhammeret al.Input data set

52



65,00%

70,00%

75,00%

80,00%

85,00%

90,00%

95,00%


Compression rate

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%


Energy Savings


53

SUMMARY COMPARISON AND COMMENTS

No single algorithm is «the best» Transmission procedures with packed based protocols can

affect the analysis Higher packing factors improve energy efficiency Higher transmission delays negatively affect timeliness

Adaptable procedures should be used on the basis of The peculiar features of the signals to be processed The quality requirements of the applications


54

54

Programs and Data


55

Philosophy without Science is empty,

Science without Philosophy is

blind I. Kant

PARAPHRASE Programs without Data are empty, Data without Programs are blind F. A. Schreiber

SUMMARY AND CONCLUSIONS


56

Experiments on Databases and DBMSs for optimizing data structures and management including Data Quality

Data organization and management as a service to the experiments of the scientific community

Experimenting with the Database content itself (data mining)

Experimentation is both: a science because it requires formal and rigorous

methodologies, languages, and instruments an art because it requires intuition, phantasy, and …

it gives emotions

BIBLIOGRAPHICAL REFERENCES


1. Babu S. et Al. – Automated Experiment-Driven Management of (Database) Systems – Proc. 12th HotOS, pp. 1 – 5, 2009

2. Boral H., DeWitt D. J. – A Methodology for Database Systems Performance Evaluation – SIGMOD Record, Vol. 14, n. 2, pp. 176-185, 1984

3. Brown D. et Al. – High energy nuclear database: a testbed for nuclear data information technology – Int. Conf. On nuclear data for Science and Technology, art. 250, 2007

4. Cappiello C., Schreiber F.A. - Experiments and analysis of quality and Energy-aware data aggregation approaches in WSNs - 10th Int. Workshop on Quality in Databases QDB 2012, Istanbul, Aug. 26, 2012, pp. 1- 8 http://www.purdue.edu/discoverypark/cyber/qdb2012/papers/7data%20aggregation.pdf

5. Curino C. et Al. – Schema Evolution in Wikipedia: Toward a Web Information System Benchmark – Proc. ICEIS, pp. 323 – 332, 2008

6. Curino et Al. – Graceful Database Schema Evolution: the PRISM Workbench – Proc. VLDB’08, pp. 761 – 772, 2008

7. Davcev D. et Al. – Experiments in Data Management for Wireless Sensor Networks – Proc. 2° Int. Conf. on Sensor Technologies and Applications , pp. 198 – 202, 2008

8. Manolescu I. et Al. - The Repeatability Experiment of SIGMOD 2008 - SIGMOD Record, Vol. 37, n. 1, pp. 39 – 45, 2008

57

BIBLIOGRAPHICAL REFERENCES


58

9. Marche S. – Measuring the stability of data models – European Journal of Information Systems, Vol.2, n.1, pp. 37 – 47, 1993

10. Masseroli M. - Management and Analysis of Genomic Functional and Phenotypic Controlled Annotations to Support Biomedical Investigation and Practice - IEEE Transactions on Information Technology in Biomedicine, Vol. 11, n. 4, pp. 376-385, 2007

11. Sjoberg D. I. – Quantifying schema evolution – Information asnd software technology, Vol. 35, n. 1, pp.35 - 44, 1993

12. Stoeckert C. et Al. – Microarray databases: standards and ontologies – Nature genetics, Vol. 32, pp. 469 – 473, 2002

13. Szalay A., Gray J. – The world-wide telescope – Science, Vol. 293, pp. 2037 – 2040, 2001

14. Vanschoren J., Blockeel H. – Experiment Databases - In: Dzeroski S., Goethals B., Panov P. (Eds.), Inductive Databases and Queries: Constraint-based Data Mining, Chapt. 14, Springer, pp. 335 - 360, 2010

15. Schwartz B. – The four fundamental performance metrics – PERCONA, 2011 http://www.mysqlperformanceblog.com/2011/04/27/the-four-fundamental-performance-metrics/

16. http://www.tpc.org/information/benchmarks.asp

https://lirias.kuleuven.be/handle/123456789/273221

experimental methods and techniques in ...home.deib.polimi.it/schiaffo/cs/experimental methods...

Documents