bigdansing presentation slides for kaust

BigDansing: A BigData Cleansing System

By: Zuhair Khayyat

InfoCloud group, Computer, Electrical and Mathematical Sciences and Engineering Division

King Abdullah University of Science and Technology (KAUST)

3

Example of a Dirty Dataset

● Company employee database:

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

4



– Business rule: Any two employees in same Zipcode must be in same City.


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

5





t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

6





t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 LA CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

7

Is Dirty Data a Real Problem?

● “Software expert Hollis Tibbets, the Global Director of Marketing at Dell, estimates that duplicate data and bad data combined cost the U.S. economy over $3 trillion every year”

● “duplicate and dirty data costs the healthcare industry over $300 billion every year.”

● Lost revenue, data repair costs.

– By: Joe Fusaro in www.ringlead.com/blog/dirty-data-costs-economy-3-trillion

http://www.ringlead.com/blog/dirty-data-costs-economy-3-trillion

8

Is Dirty Data a Real Problem?

● “New research from Experian Data Quality shows that inaccurate data has a direct impact on the bottom line of 88% of companies, with the average company losing 12% of its revenue”

– By: Ben Davis in

https://econsultancy.com/blog/64612-the-cost-of-bad-data-stats/



9

What is Data Cleansing?

● To detect and correct corrupt or inaccurate records from a record set, table, or database.

● 25% of world's critical data are dirty:

– Typos, duplicates, outdated data, Missing values

● Dirty data sources:

– Data entry errors

– Data update errors

– Data transmission errors

– Bugs in a data processing tools

10

How to Detect Dirty Data?

● Dirty data is detected by declarative rules:

– A formal way to express dirty data.

11




– Functional dependencies (FD):● A constraint between two sets of attributes in a relation● Example: Zipcode → City

12




– Functional dependencies (FD):● A constraint between two sets of attributes in a relation● Example: Zipcode → City

– Conditional functional dependency (CFD),● Country = 'Saudi Arabia', Zipcode → City

13




– Functional dependencies (FD),

– Conditional functional dependency (CFD),

– Denial constraints (DC):● A set of boolean conditions should not be satisfied ● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate <

t2.Rate)– There can't exists two tuples in relation D where the

salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

14




– Functional dependencies (FD)

– Conditional functional dependency (CFD),

– Denial constraints (DC).

● or, User defined function (UDF):

– Duplicates, statistical errors.

15



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

● FD: Zipcode → City

● Two tuples sharing the same Zipcode must have the same City name.

16



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)● There can't exists two tuples in relation D where the salary of t1 is

greater than t2's salary and the Rate of t1 is less than the t2's rate.

17



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

● DC: ∀ t1, t2 ∈ D, ¬(simF(t1.name,t2.name) ˄ t1.city = t2.city)

● There can't exists two tuples in relation D with similar names and live in the same city

18

Data Cleansing Process

Dirtydata

QualityRules

19


Detect Repair

Update

Dirtydata

QualityRules

20


Detect Repair

Update

Dirtydata

Cleandata

QualityRules

21


Detect Repair

Update

Dirtydata

Cleandata

QualityRules

90% runtime 90% research

10% errors max

22


Update

Dirtydata

Cleandata

QualityRules

90% runtime 90% research

Repair

● Researchers target:

● Better quality

● Less iterations

to reduce runtime

Detect

● Researchers use:

● Naive code

● DBMS

10% errors max

23

A Big Data Challenge

● How to process large data cleansing?● How to support known declarative rules and possible

UDFs?

24

BigDansing architecture

25

BigDansing

✔ Generic abstraction:– Support rule-based detection: FD, CFD, DC

– Support UDF-based detection.

– Easy to use, auto parallelization

– Separate logical from physical operators:● System independent● Provide multiple physical optimizations

26

BigDansing

✔ Generic abstraction:– Support rule-based detection: FD, CFD, DC

– Support UDF-based detection.

– Easy to use, auto parallelization

– Separate logical from physical operators:● System independent● Provide multiple physical optimizations

✔ Fast and scalable detection, repair and updates– 1.9B rows → 13B violations < 3 hours on 16 small machines.

– Related work maximum 1 M rows on a single machine.

27

Abstraction

28

BigDansing Semantics

● Input data set represented by a set of data units

● Each data unit “U”:

– A single row in relational data

– A single triple in RDF data

– A single article in Wikipedia

29

Logical Operators

● Quality rules are represented by 5 functions.

● BigDansing automatically translates declarative rules into logical operators.

● Users are free to implement their logic using Logical operators.

● Fundamental operators: the minimum to represent large set of data quality rules

Fundamental Optional

DetectGenFix

ScopeBlockIterate

Data Cleansing Job

Database

30

BigDansing Semantics - Scope

● Scope:

– Input: data units

– Output: data units

● Example: Zipcode → City

– Input: ● t1 – t7

– Output:● t1 – t7 (Zipcode,City)

Name Zipcode City

t1 Annie 10001 NY

t2 Laure 90210 LA

t3 John 60601 CH

t4 Mark 90210 SF

t5 Robert 60827 CH

t6 Mary 90210 LA

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

31

BigDansing Semantics - Block

● Block:

– Input: data unit

– Output: grouping key


– Input:● t1 – t7

– Output: ● <10001, t1>, <90210, (t2,t4,t6)>, <60601,t3)>,

<60827,t5>

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

32

BigDansing Semantics - Iterate

● Iterate:

– Input: a group of data units

– Output: single tuple, tuple pair


– Input: ● <10001, t1>, <90210, (t2,t4,t6)>, <60601,t3)>,

<60827,t5>– Output:

● <t2,t4>, <t2,t6>, <t4,t6>

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

33

BigDansing Semantics - Detect

● Detect:

– Input: data units

– Output: Violation(s)


– Input:● <t2,t4>, <t2,t6>, <t4,t6>

– Output: ● (t2.City ≠ t4.City), (t4.City ≠ t6.City)

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

34

Semantics - GenFix

● GenFix:

– Input: Violation

– Output: possible fix(es)


– Input:● (t2.City ≠ t4.City), (t4.City ≠ t6.City)

– Output:● (t2.City = t4.City), (t4.City = t6.City)

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

35

Logical Planning

36

Logical Planning

● Logical plan define the data unit flow

● Validating the plan:

– At least one input dataset

– For UDF: at least one detect

– For Rules: at least one rule● Support simple and bushy plans

37

Logical Planning – FD example


● Operators:

– Scope(Zipcode,City)

– Block(Zipcode)

– Iterate(n2)

– Detect(tx.City ≠ ty.City)

– GenFix(tx.City = ty.City)

Dataset Scope Block Iterate Detect GenFix

38

Logical Planning – DC example

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● There can't exists two tuples in relation D where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

● Operators:

– Scope(Salary,Rate)

– Detect(tx.Salary > ty.Salary AND tx.Rate < ty.Rate)

– GenFix(tx.Salary <= ty.Salary OR tx.Rate >= ty.Rate)

Dataset Scope Detect GenFix

39

Logical Planning – UDF only example

● Dataset: Temperature sensors dataset

● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average. Sensor

IDRoom Temp

t1 1 Bedroom 36.6º

t2 2 Roof 40º

t3 3 Bedroom 35.2º

t4 4 Bedroom 43.1º

t5 5 Bedroom 33.5º

40



● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average.

● Operators:

– Scope(Room,Temp)

– Block(Room)

– Iterate(Average the list,tx)

– Detect(tx.temp < avg-c OR tx.temp > avg+c)

– GenFix(tx.temp >= avg-c AND tx.temp <= avg+c)

Sensor ID

Room Temp

t1 1 Bedroom 36.6º

t2 2 Roof 40º

t3 3 Bedroom 35.2º

t4 4 Bedroom 43.1º

t5 5 Bedroom 33.5º


41



● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average.

● Operators:

– Scope(Room,Temp)

– Block(Room)

– Iterate(Average the list,tx)

– Detect(tx.temp < avg-c OR tx.temp > avg+c)

– GenFix(tx.temp >= avg-c AND tx.temp <= avg+c)

Sensor ID

Room Temp

t1 1 Bedroom 36.6º

t2 2 Roof 40º

t3 3 Bedroom 35.2º

t4 4 Bedroom 43.1º

t5 5 Bedroom 33.5º


42

Logical Plans – Bushy plan

● C1, C2 and C3 are denial constraints from ICDE 2013 paper:● Holistic Data Cleaning:

Putting Violations Into Context

43

Physical Plans

44

Physical Plans

● Physical operators are system specific

– MPI, Hadoop, Spark

● Each physical operator is an independent execution unit.

● Each logical operator → one physical operator.

● BigDansing consolidate logical plans to improve I/O.

● More physical operators can be added with different optimizations to improve logical plans.

45

Physical Plans - Plan consolidation

● Plan consolidation is a static logical plan optimizations.

● BigDansing consolidates two similar logical operator if they share same input

46

Physical Plans – Physical translation



Dataset PScope PBlock PIterate PDetect PGenFix

47

Physical Plans - Physical translation



Dataset PScope PBlock PIterate PDetect PGenFix

Dataset PScope PBlock Piterate → Pdetect → PGenFix

48



● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

Dataset Scope Detect GenFixDataset Scope Detect GenFix

Dataset Scope CrossProdcutDataset PScope PDetect → PGenFix

49





Dataset Scope CrossProdcutDataset PScope

Dataset Scope UCrossProdcutDataset PScope

PDetect → PGenFix

PDetect → PGenFix

50





Dataset Scope CrossProdcutDataset PScope

Dataset Scope OCJoinDataset PScope

Distributed Sort Merge Join

Dataset Scope UCrossProdcutDataset PScope

PDetect → PGenFix

PDetect → PGenFix

PDetect → PGenFix

51

Experiments – OCJoin vs. Others

● TaxB dataset: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)● 16 workers

52

OCJoin Physical Operator

● A self join on one or more ordering comparisons:

– (<, >, ≥, ≤)● Reduce the complexity of the cross product by reducing

search space.

● Steps:

– Partitioning into blocks

– Sorting the blocks

– Pruning

– Joining

53

Repair Algorithms

54

Repair Algorithms – Basics

● BigDansing supports most of serial repair algorithms.

● BigDansing utilizes the nature of violations:

– different violations are independent.● The repair is parallelized by running different instances

of the repair algorithm on independent violations.

● We implement two serial repair algorithms to run in distributed mode:

– Equivalence class algorithm

– Hypergraph algorithm

55

Repair Algorithms – Steps

● Connected components → identify independent fixes.

● Each connected component → instance of repair algorithm.

56

Equivalence Class Algorithm

● Fix errors based on (=,≠)

● Based on heuristics:

– Partition the possible fixes into different groups

– Assign the highest frequency value to group● Example:

– Group 1: Zipcode = 60601● Highest frequency = CH

– Group 2: Zipcode = 90210● Highest frequency = LA

Name Zipcode City

t1 Annie 60601 NY

t2 Laure 90210 LA

t3 John 60601 CH

t4 Mark 90210 SF

t5 Robert 60601 CH

t6 Mary 90210 LA

t7 Jon 60601 CH

57

Hyper-Graph algorithm

● Fix errors based on (<,>,≤, and ≥).

● Based on linear optimization and greedy MVC:

– Select hyper-graph node with highest edges

– Change its value depending on edge conditions

t2.Salaryt2.tax

>,<

Name Salary Rate

t1 Annie 24000 15

t2 Laure 25000 10

t3 John 40000 25

t4 Mark 88000 24

t5 Robert 15000 15

t6 Mary 81000 28

t7 Jon 40000 25

t1.Salaryt1.tax

t5.Salaryt5.tax

>,<

t4.Salaryt4.tax

>,<

t3.Salaryt3.tax

t6.Salaryt6.tax

>,<

t7.Salaryt7.tax

>,<

58

Repair algorithms – Possible fixes


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH NY 40000 25

● FD: Zipcode → City:

● t2.City = t4.City

● t4.City = t6.City

t2.City t4.City

t6.City

=

=

59



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH NY 40000 25

● FD: Zipcode → State:

● t3.State = t7.State

t3.State

t7.State

=

60



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH NY 40000 25

● DC: ∀ t1, t2 ∈ D, ¬(t1..Salary > t2.Salary

˄ t1.Rate < t2.Rate):

● t2.Salary > t1.Salary, t2.Tax < t1.Tax

● t2.Salary > t5.Salary, t2.Tax < t5.taxt2.Salary

t2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

>,<

>,<

61

Repair algorithms – Connected components

t2.City t4.City

t3.City

t3.State

t7.State

t2.Salaryt2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

t2.City t4.City

t6.City

t3.State

t7.State

t2.Salaryt2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

>,<

>,<

>,<

>,<

=

=

=

=

= =

62

Repair algorithms – Distributed repair

t2.City t4.City

t6.City

t3.State

t7.State

t2.Salaryt2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

Equivalence classalgorithm

Equivalence classalgorithm Hyper-graph algorithm

● Different violations require different repair algorithms:

>,<

>,<

=

=

=

63

Use Case: RDF example

64

Use Case: RDF example

● There cannot exist two graduate students in two different universities and have the same professor as advisor

65

Use Case: RDF Example - Input

66

Use Case: RDF Example - Scope

RDF Scope

67

Use Case: RDF Example - Block

RDF Scope Block

68

Use Case: RDF Example - Iterate

RDF Scope Block Iterate

69

Use Case: RDF Example - Block


Block

70

Use Case: RDF Example - Iterate


Block Iterate

71

Use Case: RDF Example – Detect, GenFix


Block Iterate Detect GenFix

72

Use Case: RDF Example – Physical Plan


Block Iterate Detect GenFix

RDF PScope PBlock PIterate

PBlock Piterate → Pdetect → PGenFix

73

Experiments

74

Datasets

Dataset Type Size Error type

TaxA Synthetic based on real dataset

100K -- 40M Typos

TaxB Synthetic 100K -- 3M Numerical errors

TPCH Synthetic 100K – 1.9 B Typos

Customer1 Real 19M Duplicates

Customer2 Real 32M Duplicates

NCVoters Real 9M Duplicates

HAI Real 166K Typos

75

Systems

● NADEEF: a data cleansing system on sing le mach ine● PostgreSQL: a database management system● Shark: Distributed SQL engine based on Hive and Spark● Spark SQL: Distributed SQL engine based on Spark● BigDansing, BigDansing-Spark● BigDansing-Hadoop

76

Infrastructure and Systems

● Single machine:

– Dell Precision T7500 with two 64-bit quad-core Intel Xeon X5550, and 58GB RAM

● Cluster:

– 17 Shuttle SH55J2 machines (1 master with 16 workers) equipped with Intel i5 processors with 16GB RAM

77

Experiments – Serial FD

● TaxA dataset:


● FD: Zipcode → State

78

Experiments – Serial DC

● TaxB dataset:

– DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● OCJoin optimization

79

Experiments – Parallel FD

● TPCH dataset:

● FD: custkey → custAddress

● 16 Workers

80

Experiments – Scalability

● TPCH Dataset:

● FD: custkey → custAddress

● Dataset: 500M rows

81

Points to Remember

● We present BigDansing as a distributed system for data cleansing.

● Easy to use, no need for parallel development experience.

● Faster than all related work.

● Abstraction is independent of distributed system environment.

● Support different physical optimizations for a single logical plan.

● Scales to 1.9B rows, related work only work on 1M rows.

● Natively support repair algorithms without modifications.

82

Questions?

83

Experiments – Parallel DC

● TaxB Dataset

– DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● 16 workers

84

Repair Quality