bigdansing presentation slides for kaust
TRANSCRIPT
BigDansing: A BigData Cleansing System
By: Zuhair Khayyat
InfoCloud group, Computer, Electrical and Mathematical Sciences and Engineering Division
King Abdullah University of Science and Technology (KAUST)
2
3
Example of a Dirty Dataset
● Company employee database:
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
4
Example of a Dirty Dataset
● Company employee database:
– Business rule: Any two employees in same Zipcode must be in same City.
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
5
Example of a Dirty Dataset
● Company employee database:
– Business rule: Any two employees in same Zipcode must be in same City.
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
6
Example of a Dirty Dataset
● Company employee database:
– Business rule: Any two employees in same Zipcode must be in same City.
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 LA CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
7
Is Dirty Data a Real Problem?
● “Software expert Hollis Tibbets, the Global Director of Marketing at Dell, estimates that duplicate data and bad data combined cost the U.S. economy over $3 trillion every year”
● “duplicate and dirty data costs the healthcare industry over $300 billion every year.”
● Lost revenue, data repair costs.
– By: Joe Fusaro in www.ringlead.com/blog/dirty-data-costs-economy-3-trillion
8
Is Dirty Data a Real Problem?
● “New research from Experian Data Quality shows that inaccurate data has a direct impact on the bottom line of 88% of companies, with the average company losing 12% of its revenue”
– By: Ben Davis in
https://econsultancy.com/blog/64612-the-cost-of-bad-data-stats/
9
What is Data Cleansing?
● To detect and correct corrupt or inaccurate records from a record set, table, or database.
● 25% of world's critical data are dirty:
– Typos, duplicates, outdated data, Missing values
● Dirty data sources:
– Data entry errors
– Data update errors
– Data transmission errors
– Bugs in a data processing tools
10
How to Detect Dirty Data?
● Dirty data is detected by declarative rules:
– A formal way to express dirty data.
11
How to Detect Dirty Data?
● Dirty data is detected by declarative rules:
– A formal way to express dirty data.
– Functional dependencies (FD):● A constraint between two sets of attributes in a relation● Example: Zipcode → City
12
How to Detect Dirty Data?
● Dirty data is detected by declarative rules:
– A formal way to express dirty data.
– Functional dependencies (FD):● A constraint between two sets of attributes in a relation● Example: Zipcode → City
– Conditional functional dependency (CFD),● Country = 'Saudi Arabia', Zipcode → City
13
How to Detect Dirty Data?
● Dirty data is detected by declarative rules:
– A formal way to express dirty data.
– Functional dependencies (FD),
– Conditional functional dependency (CFD),
– Denial constraints (DC):● A set of boolean conditions should not be satisfied ● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate <
t2.Rate)– There can't exists two tuples in relation D where the
salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.
14
How to Detect Dirty Data?
● Dirty data is detected by declarative rules:
– A formal way to express dirty data.
– Functional dependencies (FD)
– Conditional functional dependency (CFD),
– Denial constraints (DC).
● or, User defined function (UDF):
– Duplicates, statistical errors.
15
Example of a Dirty Dataset
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
● FD: Zipcode → City
● Two tuples sharing the same Zipcode must have the same City name.
16
Example of a Dirty Dataset
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)● There can't exists two tuples in relation D where the salary of t1 is
greater than t2's salary and the Rate of t1 is less than the t2's rate.
17
Example of a Dirty Dataset
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH IL 40000 25
● DC: ∀ t1, t2 ∈ D, ¬(simF(t1.name,t2.name) ˄ t1.city = t2.city)
● There can't exists two tuples in relation D with similar names and live in the same city
18
Data Cleansing Process
Dirtydata
QualityRules
19
Data Cleansing Process
Detect Repair
Update
Dirtydata
QualityRules
20
Data Cleansing Process
Detect Repair
Update
Dirtydata
Cleandata
QualityRules
21
Data Cleansing Process
Detect Repair
Update
Dirtydata
Cleandata
QualityRules
90% runtime 90% research
10% errors max
22
Data Cleansing Process
Update
Dirtydata
Cleandata
QualityRules
90% runtime 90% research
Repair
● Researchers target:
● Better quality
● Less iterations
to reduce runtime
Detect
● Researchers use:
● Naive code
● DBMS
10% errors max
23
A Big Data Challenge
● How to process large data cleansing?● How to support known declarative rules and possible
UDFs?
24
BigDansing architecture
25
BigDansing
✔ Generic abstraction:– Support rule-based detection: FD, CFD, DC
– Support UDF-based detection.
– Easy to use, auto parallelization
– Separate logical from physical operators:● System independent● Provide multiple physical optimizations
26
BigDansing
✔ Generic abstraction:– Support rule-based detection: FD, CFD, DC
– Support UDF-based detection.
– Easy to use, auto parallelization
– Separate logical from physical operators:● System independent● Provide multiple physical optimizations
✔ Fast and scalable detection, repair and updates– 1.9B rows → 13B violations < 3 hours on 16 small machines.
– Related work maximum 1 M rows on a single machine.
27
Abstraction
28
BigDansing Semantics
● Input data set represented by a set of data units
● Each data unit “U”:
– A single row in relational data
– A single triple in RDF data
– A single article in Wikipedia
29
Logical Operators
● Quality rules are represented by 5 functions.
● BigDansing automatically translates declarative rules into logical operators.
● Users are free to implement their logic using Logical operators.
● Fundamental operators: the minimum to represent large set of data quality rules
Fundamental Optional
DetectGenFix
ScopeBlockIterate
Data Cleansing Job
Database
30
BigDansing Semantics - Scope
● Scope:
– Input: data units
– Output: data units
● Example: Zipcode → City
– Input: ● t1 – t7
– Output:● t1 – t7 (Zipcode,City)
Name Zipcode City
t1 Annie 10001 NY
t2 Laure 90210 LA
t3 John 60601 CH
t4 Mark 90210 SF
t5 Robert 60827 CH
t6 Mary 90210 LA
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
31
BigDansing Semantics - Block
● Block:
– Input: data unit
– Output: grouping key
● Example: Zipcode → City
– Input:● t1 – t7
– Output: ● <10001, t1>, <90210, (t2,t4,t6)>, <60601,t3)>,
<60827,t5>
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
32
BigDansing Semantics - Iterate
● Iterate:
– Input: a group of data units
– Output: single tuple, tuple pair
● Example: Zipcode → City
– Input: ● <10001, t1>, <90210, (t2,t4,t6)>, <60601,t3)>,
<60827,t5>– Output:
● <t2,t4>, <t2,t6>, <t4,t6>
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
33
BigDansing Semantics - Detect
● Detect:
– Input: data units
– Output: Violation(s)
● Example: Zipcode → City
– Input:● <t2,t4>, <t2,t6>, <t4,t6>
– Output: ● (t2.City ≠ t4.City), (t4.City ≠ t6.City)
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
34
Semantics - GenFix
● GenFix:
– Input: Violation
– Output: possible fix(es)
● Example: Zipcode → City
– Input:● (t2.City ≠ t4.City), (t4.City ≠ t6.City)
– Output:● (t2.City = t4.City), (t4.City = t6.City)
Zipcode City
t1 10001 NY
t2 90210 LA
t3 60601 CH
t4 90210 SF
t5 60827 CH
t6 90210 LA
35
Logical Planning
36
Logical Planning
● Logical plan define the data unit flow
● Validating the plan:
– At least one input dataset
– For UDF: at least one detect
– For Rules: at least one rule● Support simple and bushy plans
37
Logical Planning – FD example
● FD: Zipcode → City
● Operators:
– Scope(Zipcode,City)
– Block(Zipcode)
– Iterate(n2)
– Detect(tx.City ≠ ty.City)
– GenFix(tx.City = ty.City)
Dataset Scope Block Iterate Detect GenFix
38
Logical Planning – DC example
● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
● There can't exists two tuples in relation D where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.
● Operators:
– Scope(Salary,Rate)
– Detect(tx.Salary > ty.Salary AND tx.Rate < ty.Rate)
– GenFix(tx.Salary <= ty.Salary OR tx.Rate >= ty.Rate)
Dataset Scope Detect GenFix
39
Logical Planning – UDF only example
● Dataset: Temperature sensors dataset
● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average. Sensor
IDRoom Temp
t1 1 Bedroom 36.6º
t2 2 Roof 40º
t3 3 Bedroom 35.2º
t4 4 Bedroom 43.1º
t5 5 Bedroom 33.5º
40
Logical Planning – UDF only example
● Dataset: Temperature sensors dataset
● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average.
● Operators:
– Scope(Room,Temp)
– Block(Room)
– Iterate(Average the list,tx)
– Detect(tx.temp < avg-c OR tx.temp > avg+c)
– GenFix(tx.temp >= avg-c AND tx.temp <= avg+c)
Sensor ID
Room Temp
t1 1 Bedroom 36.6º
t2 2 Roof 40º
t3 3 Bedroom 35.2º
t4 4 Bedroom 43.1º
t5 5 Bedroom 33.5º
Dataset Scope Block Iterate Detect GenFix
41
Logical Planning – UDF only example
● Dataset: Temperature sensors dataset
● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average.
● Operators:
– Scope(Room,Temp)
– Block(Room)
– Iterate(Average the list,tx)
– Detect(tx.temp < avg-c OR tx.temp > avg+c)
– GenFix(tx.temp >= avg-c AND tx.temp <= avg+c)
Sensor ID
Room Temp
t1 1 Bedroom 36.6º
t2 2 Roof 40º
t3 3 Bedroom 35.2º
t4 4 Bedroom 43.1º
t5 5 Bedroom 33.5º
Dataset Scope Block Iterate Detect GenFix
42
Logical Plans – Bushy plan
● C1, C2 and C3 are denial constraints from ICDE 2013 paper:● Holistic Data Cleaning:
Putting Violations Into Context
43
Physical Plans
44
Physical Plans
● Physical operators are system specific
– MPI, Hadoop, Spark
● Each physical operator is an independent execution unit.
● Each logical operator → one physical operator.
● BigDansing consolidate logical plans to improve I/O.
● More physical operators can be added with different optimizations to improve logical plans.
45
Physical Plans - Plan consolidation
● Plan consolidation is a static logical plan optimizations.
● BigDansing consolidates two similar logical operator if they share same input
46
Physical Plans – Physical translation
● FD: Zipcode → City
Dataset Scope Block Iterate Detect GenFix
Dataset PScope PBlock PIterate PDetect PGenFix
47
Physical Plans - Physical translation
● FD: Zipcode → City
Dataset Scope Block Iterate Detect GenFix
Dataset PScope PBlock PIterate PDetect PGenFix
Dataset PScope PBlock Piterate → Pdetect → PGenFix
48
Physical Plans – Physical translation
● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.
Dataset Scope Detect GenFixDataset Scope Detect GenFix
Dataset Scope CrossProdcutDataset PScope PDetect → PGenFix
49
Physical Plans – Physical translation
● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.
Dataset Scope Detect GenFixDataset Scope Detect GenFix
Dataset Scope CrossProdcutDataset PScope
Dataset Scope UCrossProdcutDataset PScope
PDetect → PGenFix
PDetect → PGenFix
50
Physical Plans – Physical translation
● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.
Dataset Scope Detect GenFixDataset Scope Detect GenFix
Dataset Scope CrossProdcutDataset PScope
Dataset Scope OCJoinDataset PScope
Distributed Sort Merge Join
Dataset Scope UCrossProdcutDataset PScope
PDetect → PGenFix
PDetect → PGenFix
PDetect → PGenFix
51
Experiments – OCJoin vs. Others
● TaxB dataset: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)● 16 workers
52
OCJoin Physical Operator
● A self join on one or more ordering comparisons:
– (<, >, ≥, ≤)● Reduce the complexity of the cross product by reducing
search space.
● Steps:
– Partitioning into blocks
– Sorting the blocks
– Pruning
– Joining
53
Repair Algorithms
54
Repair Algorithms – Basics
● BigDansing supports most of serial repair algorithms.
● BigDansing utilizes the nature of violations:
– different violations are independent.● The repair is parallelized by running different instances
of the repair algorithm on independent violations.
● We implement two serial repair algorithms to run in distributed mode:
– Equivalence class algorithm
– Hypergraph algorithm
55
Repair Algorithms – Steps
● Connected components → identify independent fixes.
● Each connected component → instance of repair algorithm.
56
Equivalence Class Algorithm
● Fix errors based on (=,≠)
● Based on heuristics:
– Partition the possible fixes into different groups
– Assign the highest frequency value to group● Example:
– Group 1: Zipcode = 60601● Highest frequency = CH
– Group 2: Zipcode = 90210● Highest frequency = LA
Name Zipcode City
t1 Annie 60601 NY
t2 Laure 90210 LA
t3 John 60601 CH
t4 Mark 90210 SF
t5 Robert 60601 CH
t6 Mary 90210 LA
t7 Jon 60601 CH
57
Hyper-Graph algorithm
● Fix errors based on (<,>,≤, and ≥).
● Based on linear optimization and greedy MVC:
– Select hyper-graph node with highest edges
– Change its value depending on edge conditions
t2.Salaryt2.tax
>,<
Name Salary Rate
t1 Annie 24000 15
t2 Laure 25000 10
t3 John 40000 25
t4 Mark 88000 24
t5 Robert 15000 15
t6 Mary 81000 28
t7 Jon 40000 25
t1.Salaryt1.tax
t5.Salaryt5.tax
>,<
t4.Salaryt4.tax
>,<
t3.Salaryt3.tax
t6.Salaryt6.tax
>,<
t7.Salaryt7.tax
>,<
58
Repair algorithms – Possible fixes
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH NY 40000 25
● FD: Zipcode → City:
● t2.City = t4.City
● t4.City = t6.City
t2.City t4.City
t6.City
=
=
59
Repair algorithms – Possible fixes
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH NY 40000 25
● FD: Zipcode → State:
● t3.State = t7.State
t3.State
t7.State
=
60
Repair algorithms – Possible fixes
Name Zipcode City State Salary Rate
t1 Annie 10001 NY NY 24000 15
t2 Laure 90210 LA CA 25000 10
t3 John 60601 CH IL 40000 25
t4 Mark 90210 SF CA 88000 28
t5 Robert 60827 CH IL 15000 15
t6 Mary 90210 LA CA 81000 28
t7 Jon 60601 CH NY 40000 25
● DC: ∀ t1, t2 ∈ D, ¬(t1..Salary > t2.Salary
˄ t1.Rate < t2.Rate):
● t2.Salary > t1.Salary, t2.Tax < t1.Tax
● t2.Salary > t5.Salary, t2.Tax < t5.taxt2.Salary
t2.Tax
t1.Salaryt1.Tax
t5.Salaryt5.Tax
>,<
>,<
61
Repair algorithms – Connected components
t2.City t4.City
t3.City
t3.State
t7.State
t2.Salaryt2.Tax
t1.Salaryt1.Tax
t5.Salaryt5.Tax
t2.City t4.City
t6.City
t3.State
t7.State
t2.Salaryt2.Tax
t1.Salaryt1.Tax
t5.Salaryt5.Tax
>,<
>,<
>,<
>,<
=
=
=
=
= =
62
Repair algorithms – Distributed repair
t2.City t4.City
t6.City
t3.State
t7.State
t2.Salaryt2.Tax
t1.Salaryt1.Tax
t5.Salaryt5.Tax
Equivalence classalgorithm
Equivalence classalgorithm Hyper-graph algorithm
● Different violations require different repair algorithms:
>,<
>,<
=
=
=
63
Use Case: RDF example
64
Use Case: RDF example
● There cannot exist two graduate students in two different universities and have the same professor as advisor
65
Use Case: RDF Example - Input
66
Use Case: RDF Example - Scope
RDF Scope
67
Use Case: RDF Example - Block
RDF Scope Block
68
Use Case: RDF Example - Iterate
RDF Scope Block Iterate
69
Use Case: RDF Example - Block
RDF Scope Block Iterate
Block
70
Use Case: RDF Example - Iterate
RDF Scope Block Iterate
Block Iterate
71
Use Case: RDF Example – Detect, GenFix
RDF Scope Block Iterate
Block Iterate Detect GenFix
72
Use Case: RDF Example – Physical Plan
RDF Scope Block Iterate
Block Iterate Detect GenFix
RDF PScope PBlock PIterate
PBlock Piterate → Pdetect → PGenFix
73
Experiments
74
Datasets
Dataset Type Size Error type
TaxA Synthetic based on real dataset
100K -- 40M Typos
TaxB Synthetic 100K -- 3M Numerical errors
TPCH Synthetic 100K – 1.9 B Typos
Customer1 Real 19M Duplicates
Customer2 Real 32M Duplicates
NCVoters Real 9M Duplicates
HAI Real 166K Typos
75
Systems
● NADEEF: a data cleansing system on sing le mach ine● PostgreSQL: a database management system● Shark: Distributed SQL engine based on Hive and Spark● Spark SQL: Distributed SQL engine based on Spark● BigDansing, BigDansing-Spark● BigDansing-Hadoop
76
Infrastructure and Systems
● Single machine:
– Dell Precision T7500 with two 64-bit quad-core Intel Xeon X5550, and 58GB RAM
● Cluster:
– 17 Shuttle SH55J2 machines (1 master with 16 workers) equipped with Intel i5 processors with 16GB RAM
77
Experiments – Serial FD
● TaxA dataset:
● FD: Zipcode → City
● FD: Zipcode → State
78
Experiments – Serial DC
● TaxB dataset:
– DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
● OCJoin optimization
79
Experiments – Parallel FD
● TPCH dataset:
● FD: custkey → custAddress
● 16 Workers
80
Experiments – Scalability
● TPCH Dataset:
● FD: custkey → custAddress
● Dataset: 500M rows
81
Points to Remember
● We present BigDansing as a distributed system for data cleansing.
● Easy to use, no need for parallel development experience.
● Faster than all related work.
● Abstraction is independent of distributed system environment.
● Support different physical optimizations for a single logical plan.
● Scales to 1.9B rows, related work only work on 1M rows.
● Natively support repair algorithms without modifications.
82
Questions?
83
Experiments – Parallel DC
● TaxB Dataset
– DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
● 16 workers
84
Repair Quality