2016 vldb - messing up with bart: error generation for evaluating data-cleaning algorithms

22
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro University of Toronto, Illinois Institute of Technology, Sep 7 th 2016

Upload: boris-glavic

Post on 19-Feb-2017

35 views

Category:

Science


0 download

TRANSCRIPT

Page 1: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

University of Toronto, Illinois Institute of Technology, Università della Basilicata, Arizona State University

Sep 7th 2016

Page 2: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Overview 2

‣ Motivations and Goals

‣ Main Ideas

‣ Optimizations

‣ Experimental Results

Page 3: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Motivation•Data quality is a crucial task in data

management•Many automatic and semi-automatic

data-cleaning algorithm have been proposed

3

constraint-based

Beskales et al. VLDB10Bohannon et al. SIGMOD05Chu et al. ICDE13Cong et al. VLDB07Geerts et al. VLDB14… statistics-

based

Berti-Equille et al. ICDE11Dasu et al. VLDB12Prokoshyna et al. VLDB15Yakout et al. SIGMOD13…

Page 4: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Motivation•Data quality is a crucial task in data

management•Many automatic and semi-automatic

data-cleaning algorithm have been proposed

4

constraint-based

Beskales et al. VLDB10Bohannon et al. SIGMOD05Chu et al. ICDE13Cong et al. VLDB07Geerts et al. VLDB14… statistics-

based

Berti-Equille et al. ICDE11Dasu et al. VLDB12Prokoshyna et al. VLDB15Yakout et al. SIGMOD13…

“What is the right tool for my

data-cleaning task?”

Page 5: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Challenges•No openly-available tools or datasets

for benchmarking data-cleaning algorithms

•Usually approaches are evaluated by using either•manually generated errors: very

expensive! •automatically introduced errors in clean

data: algorithms are highly sensitive to the characteristics of the errors!

•Need for scalable and robust evaluation

5

Page 6: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Contribution• Benchmarking Algorithms for data Repairing and

Translation• open-source error-generation system with an high level

of control over the errors

• Input: a clean database wrt a set of data-quality rules and a set of configuration parameters

• Output: a dirty database (using a set of cell changes) and an estimate of how hard it will be to restore the original values

6

Page 7: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Overview 7

‣ Motivations and Goals‣ Main Ideas

‣ Optimizations

‣ Experimental Results

‣ Detectability

‣ Repairability‣ Violation-Generation

Queries

Page 8: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

A Motivating Example 8

Player

Name Season Team Stadium Goal

s

t1 Giovinco 2013-14

Juventus

Juventus Stadium 3

t2 Giovinco 2014-15 Toronto BMO Field 23

t3 Pirlo 2014-15

Juventus

Juventus Stadium 5

t4 Pirlo 2015-16

N.Y. City Yankee St. 0

t5 Vidal 2014-15

Juventus

Juventus Stadium 5

t6 Vidal 2015-16 Bayern Allianz Arena 3

functional dependencyName, Season → TeamTeam → Stadium

Quality Rules

Represented as Denial Constraintsa very expressive language to capture most data-quality rules used for data repairing: FDs, CFDs, Cleaning EGDs, Editing Rules, Fixing Rules, Ordering Constraints

dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ )dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )

ViolationAn instance I violates

¬(φ(x)) if there is an assignment m

s.t. I ⊨ φ(m(x))

12

21

Page 9: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

A Motivating Example9

Player

Name Season Team Stadium Goal

s

t1 Giovinco 2013-14

Juventus

Juventus Stadium 3

t2 Giovinco 2014-15 Toronto BMO Field 23

t3 Pirlo 2014-15

Juventus

Juventus Stadium 5

t4 Pirlo 2015-16

N.Y. City Yankee St. 0

t5 Vidal 2014-15

Juventus

Juventus Stadium 5

t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,

s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )

Camp Nou

Cell Changesch1: t5. Stadium := “Camp Nou”

✔ ch1 is a detectable change: dc2 is violated since t1, t3 and t5 have same team, but different stadiums

we call {t1, t3, t5} context equivalence class

✔ easy to correct: the original value “Juventus Stadium” appears in t1,t3Repairability: the probability of restoring t5.Stadium to its original value by uniformly at random picking a Stadium value from its context equivalence class

Rep = 2 / 3 = 0.66

functional dependencyName, Season → TeamTeam → Stadium

12

Page 10: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

A Motivating Example10

Player

Name Season Team Stadium Goal

s

t1 Giovinco 2013-14

Juventus

Juventus Stadium 3

t2 Giovinco 2014-15 Toronto BMO Field 23

t3 Pirlo 2014-15

Juventus

Juventus Stadium 5

t4 Pirlo 2015-16

N.Y. City Yankee St. 0

t5 Vidal 2014-15

Juventus

Juventus Stadium 5

t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,

s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )

Cell Changesch2: t1. Season:= “2014-15”

✔ ch2 is a detectable change: dc1 is violated: t1 and t2 have same name and season, but different teams, stadium and goals

2014-15

✘ hard to correct: the original value “2013-14” disappears from the instanceRepairability: 0 / 2 = 0

functional dependencyName, Season → TeamTeam → Stadium

12

Page 11: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

A Motivating Example11

Player

Name Season Team Stadium Goal

s

t1 Giovinco 2013-14

Juventus

Juventus Stadium 3

t2 Giovinco 2014-15 Toronto BMO Field 23

t3 Pirlo 2014-15

Juventus

Juventus Stadium 5

t4 Pirlo 2015-16

N.Y. City Yankee St. 0

t5 Vidal 2014-15

Juventus

Juventus Stadium 5

t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,

s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )

Cell Changesch3: t5. Name:= “Pirlo”✘ is a undetectable change

Pirlo

INTERACTION

ch2: t1. Season:= “2014-15” ✔

2014-15

ch4: t3.Name:= “Pirlo” ✔

Pirlo

2014-15

We need to keep track of the context of each change

Page 12: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Violation-Generation Queries

• Each comparison of a dc suggests a different strategy for finding cells to modify to generate detectable errors

• Starting from a dc we generate a set of vio-gen queries

12

Name Season Teamt1 Giovinco 2013-14 Juventust2 Giovinco 2013-14 Juventust3 Pirlo 2013-14 N.Y. City

dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ )

Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’),

n=n’, s=s’, t = t’

Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’),

n ≠ n’, s=s’, t ≠ t’

vio-gen query vio-gen query

Result of the query: t1, t2We’ll have a detectable change by making t1.Team and t2.Team

different t1. Team:= “Juve” ✔

Result of the query: t2, t3We’ll have a detectable change

by making t2.Name and t3.Name equalt3. Name:= “Giovinco”

Page 13: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Error-Generation Task 13

•S: relational schema•Σ: a set of denial constraints over S•I: an instance over schema S clean wrt Σ•CONF: configuration parameters• % of detectable errors, % of random errors

• Theorem 1: Generating the requested number of detectable errors is NP-Complete (data complexity)

EG-Task E={S, Σ, I, CONF}

Page 14: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Overview 14

‣ Motivations and Goals

‣ Main Ideas

‣ Optimizations

‣ Experimental Results

Page 15: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Optimizations•Greedy PTIME algorithm• two cell changes cannot share a context• sound but not complete

• in practice for low error ratios (~10-20%) the probability of success is very high

•Main cost factor•executing vio-gen queries on DBMS•optimizations for symmetric constraints and

cross-products

15

Page 16: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Symmetric Constraints• Computing joins may be expensive!• We identify a class of DCs (that includes FDs and

most of CFDs) where group-by can be used to reduce the size of join inputs

• Idea: to find and execute isomorphic subqueries to avoid redundant work

16

Player(n, s, t, st), Player(n’, s’, t’, st’),

n=n’, s=s’, t ≠ t’

1. Formula Graph

Player

n s t st

Player

t’ s’ n’st’

=

=≠Nam

eSe

ason

Stadium

Stad

ium

Name

Season

Team Team

2. Reduced Formulawith adornments

Player(n=, s=, t ≠, st)

3. Group-By Query

SELECT name, season, team FROM playerWHERE name, season IN

(SELECT name, season FROM playerGROUP BY name, seasonHAVING count(DISTINCT team) > 1)

ORDER BY name, season

Page 17: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Cross Products 17

A Common Patterndc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )

Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’, st ≠ st’

The result of the vio-gen query will be all possible pairs of players with different team and different stadium quadratic

costHowever: we are typically only interested in a small set of cellsSolution: we materialize a random sample of the tuples in Player

in main-memory and compute the cross product to identify cells to change and

their contexts

Page 18: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Overview 18

‣ Motivations and Goals

‣ Main Ideas

‣ Optimizations

‣ Experimental Results

Page 19: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Evaluation of the ToolsTools

- Llunatic: Geerts et al. VLDB14- Holistic: Chu et al. ICDE13- Greedy: Bohannon et al. SIGMOD05, Cong et al. VLDB07- Sampling: Beskales et al. VLDB10

Tasks- Constraint-based with 5% errors and different repairability levels: High (~ 0.8), Med (~0.5), and Low (~0.25)

19

Page 20: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

Scalability Results 20

Page 21: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro

VLDB 2016 - Sep 7th

21Lessons Learned

•Automated tools are essential for robust and broad empirical evaluations

•Data-repairing is not yet mature: no definitive automatic data-repairing algorithm yet

•Repairability matters•We need to document our dirty data• Algorithms are sensitive to error

characteristics!

•Generating errors is hard

Page 22: 2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleaning Algorithms

22