2016 vldb - messing up with bart: error generation for evaluating data-cleaning algorithms
TRANSCRIPT
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
University of Toronto, Illinois Institute of Technology, Università della Basilicata, Arizona State University
Sep 7th 2016
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 2
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Motivation•Data quality is a crucial task in data
management•Many automatic and semi-automatic
data-cleaning algorithm have been proposed
3
constraint-based
Beskales et al. VLDB10Bohannon et al. SIGMOD05Chu et al. ICDE13Cong et al. VLDB07Geerts et al. VLDB14… statistics-
based
Berti-Equille et al. ICDE11Dasu et al. VLDB12Prokoshyna et al. VLDB15Yakout et al. SIGMOD13…
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Motivation•Data quality is a crucial task in data
management•Many automatic and semi-automatic
data-cleaning algorithm have been proposed
4
constraint-based
Beskales et al. VLDB10Bohannon et al. SIGMOD05Chu et al. ICDE13Cong et al. VLDB07Geerts et al. VLDB14… statistics-
based
Berti-Equille et al. ICDE11Dasu et al. VLDB12Prokoshyna et al. VLDB15Yakout et al. SIGMOD13…
“What is the right tool for my
data-cleaning task?”
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Challenges•No openly-available tools or datasets
for benchmarking data-cleaning algorithms
•Usually approaches are evaluated by using either•manually generated errors: very
expensive! •automatically introduced errors in clean
data: algorithms are highly sensitive to the characteristics of the errors!
•Need for scalable and robust evaluation
5
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Contribution• Benchmarking Algorithms for data Repairing and
Translation• open-source error-generation system with an high level
of control over the errors
• Input: a clean database wrt a set of data-quality rules and a set of configuration parameters
• Output: a dirty database (using a set of cell changes) and an estimate of how hard it will be to restore the original values
6
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 7
‣ Motivations and Goals‣ Main Ideas
‣ Optimizations
‣ Experimental Results
‣ Detectability
‣ Repairability‣ Violation-Generation
Queries
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example 8
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3
functional dependencyName, Season → TeamTeam → Stadium
Quality Rules
Represented as Denial Constraintsa very expressive language to capture most data-quality rules used for data repairing: FDs, CFDs, Cleaning EGDs, Editing Rules, Fixing Rules, Ordering Constraints
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ )dc2: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
ViolationAn instance I violates
¬(φ(x)) if there is an assignment m
s.t. I ⊨ φ(m(x))
12
21
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example9
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Camp Nou
Cell Changesch1: t5. Stadium := “Camp Nou”
✔ ch1 is a detectable change: dc2 is violated since t1, t3 and t5 have same team, but different stadiums
we call {t1, t3, t5} context equivalence class
✔ easy to correct: the original value “Juventus Stadium” appears in t1,t3Repairability: the probability of restoring t5.Stadium to its original value by uniformly at random picking a Stadium value from its context equivalence class
Rep = 2 / 3 = 0.66
functional dependencyName, Season → TeamTeam → Stadium
12
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example10
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changesch2: t1. Season:= “2014-15”
✔ ch2 is a detectable change: dc1 is violated: t1 and t2 have same name and season, but different teams, stadium and goals
2014-15
✘ hard to correct: the original value “2013-14” disappears from the instanceRepairability: 0 / 2 = 0
functional dependencyName, Season → TeamTeam → Stadium
12
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
A Motivating Example11
Player
Name Season Team Stadium Goal
s
t1 Giovinco 2013-14
Juventus
Juventus Stadium 3
t2 Giovinco 2014-15 Toronto BMO Field 23
t3 Pirlo 2014-15
Juventus
Juventus Stadium 5
t4 Pirlo 2015-16
N.Y. City Yankee St. 0
t5 Vidal 2014-15
Juventus
Juventus Stadium 5
t6 Vidal 2015-16 Bayern Allianz Arena 3dc1: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), n=n’,
s=s’, t ≠ t’ )dc2: ¬( Play.(n, s, t, st, g), Play.(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Cell Changesch3: t5. Name:= “Pirlo”✘ is a undetectable change
Pirlo
INTERACTION
ch2: t1. Season:= “2014-15” ✔
2014-15
ch4: t3.Name:= “Pirlo” ✔
Pirlo
✘
2014-15
We need to keep track of the context of each change
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Violation-Generation Queries
• Each comparison of a dc suggests a different strategy for finding cells to modify to generate detectable errors
• Starting from a dc we generate a set of vio-gen queries
12
Name Season Teamt1 Giovinco 2013-14 Juventust2 Giovinco 2013-14 Juventust3 Pirlo 2013-14 N.Y. City
dc1: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), n=n’, s=s’, t ≠ t’ )
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’),
n=n’, s=s’, t = t’
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’),
n ≠ n’, s=s’, t ≠ t’
vio-gen query vio-gen query
Result of the query: t1, t2We’ll have a detectable change by making t1.Team and t2.Team
different t1. Team:= “Juve” ✔
Result of the query: t2, t3We’ll have a detectable change
by making t2.Name and t3.Name equalt3. Name:= “Giovinco”
✔
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Error-Generation Task 13
•S: relational schema•Σ: a set of denial constraints over S•I: an instance over schema S clean wrt Σ•CONF: configuration parameters• % of detectable errors, % of random errors
• Theorem 1: Generating the requested number of detectable errors is NP-Complete (data complexity)
EG-Task E={S, Σ, I, CONF}
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 14
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Optimizations•Greedy PTIME algorithm• two cell changes cannot share a context• sound but not complete
• in practice for low error ratios (~10-20%) the probability of success is very high
•Main cost factor•executing vio-gen queries on DBMS•optimizations for symmetric constraints and
cross-products
15
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Symmetric Constraints• Computing joins may be expensive!• We identify a class of DCs (that includes FDs and
most of CFDs) where group-by can be used to reduce the size of join inputs
• Idea: to find and execute isomorphic subqueries to avoid redundant work
16
Player(n, s, t, st), Player(n’, s’, t’, st’),
n=n’, s=s’, t ≠ t’
1. Formula Graph
Player
n s t st
Player
t’ s’ n’st’
=
=≠Nam
eSe
ason
Stadium
Stad
ium
Name
Season
Team Team
2. Reduced Formulawith adornments
Player(n=, s=, t ≠, st)
3. Group-By Query
SELECT name, season, team FROM playerWHERE name, season IN
(SELECT name, season FROM playerGROUP BY name, seasonHAVING count(DISTINCT team) > 1)
ORDER BY name, season
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Cross Products 17
A Common Patterndc4: ¬( Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t=t’, st ≠ st’ )
Player(n, s, t, st, g), Player(n’, s’, t’, st’, g’), t ≠ t’, st ≠ st’
The result of the vio-gen query will be all possible pairs of players with different team and different stadium quadratic
costHowever: we are typically only interested in a small set of cellsSolution: we materialize a random sample of the tuples in Player
in main-memory and compute the cross product to identify cells to change and
their contexts
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Overview 18
‣ Motivations and Goals
‣ Main Ideas
‣ Optimizations
‣ Experimental Results
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Evaluation of the ToolsTools
- Llunatic: Geerts et al. VLDB14- Holistic: Chu et al. ICDE13- Greedy: Bohannon et al. SIGMOD05, Cong et al. VLDB07- Sampling: Beskales et al. VLDB10
Tasks- Constraint-based with 5% errors and different repairability levels: High (~ 0.8), Med (~0.5), and Low (~0.25)
19
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
Scalability Results 20
P. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro
VLDB 2016 - Sep 7th
21Lessons Learned
•Automated tools are essential for robust and broad empirical evaluations
•Data-repairing is not yet mature: no definitive automatic data-repairing algorithm yet
•Repairability matters•We need to document our dirty data• Algorithms are sensitive to error
characteristics!
•Generating errors is hard
22