data cleaning

22
Data Cleaning Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 2 April 7, 2015. Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22

Upload: kathiravelu-pradeeban

Post on 18-Jul-2015

467 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Data Cleaning

Data Cleaning

Pradeeban Kathiravelu

INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa

Lisbon, Portugal

Data Quality – Presentation 2April 7, 2015.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 1 / 22

Page 2: Data Cleaning

Introduction

Introduction

Removal of inconsistencies and errors from original data sets.Extraction Transformation Loading (ETL) and data cleaning tools.

Modeled as graphs of data transformations.Data integration problem.

Derive structured and clean textual records.To be able to perform meaningful queries.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 2 / 22

Page 3: Data Cleaning

Introduction

Motivation

Explanation for the reasoning behind the cleaning results.

Interactive facilities to tune a data cleaning program.

A language, an execution model, and algorithms.To express data cleaning specifications declaratively.To perform the cleaning efficiently.

Data cleaning graph with data quality constraints.

Support for user involvement in data cleaning.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 3 / 22

Page 4: Data Cleaning

Introduction

Challenges in Existing Technology

Lack of separation [...].

Lack of data lineage and user interaction facilities.

Lack of logical matching operation.

User-provided criteria.Non-exhaustive.

Lack of documentation of the matching algorithms.

Lack of user consultation.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 4 / 22

Page 5: Data Cleaning

Contributions

AJAX Data Cleaning Framework and Strategy

Separation of Framework:

Logical Level.

Graph of transformations specified in declarative language.Expressible with SQL99.Explicit user interaction and stepwise refinement

Using a data lineage mechanism.

Physical Level.

Specific optimization algorithms chosen to implement thetransformations.Notation:

To specify the properties of the approximate matching function.To select an optimized implementation.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 5 / 22

Page 6: Data Cleaning

AJAX

Logical Level

Data Flow Graph: Main constituent of a data cleaning program.

Input Output flows of operators logically modeled as databaserelations.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 6 / 22

Page 7: Data Cleaning

AJAX

Framework for the bibliographic references

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 7 / 22

Page 8: Data Cleaning

Data Cleaning Strategy

1. Add a key to every input record

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 8 / 22

Page 9: Data Cleaning

Data Cleaning Strategy

2. Extract from each input record ..

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 9 / 22

Page 10: Data Cleaning

Data Cleaning Strategy

3. Extract from each input record ..

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 10 / 22

Page 11: Data Cleaning

Data Cleaning Strategy

4. Duplicate Elimination

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 11 / 22

Page 12: Data Cleaning

Data Cleaning Strategy

5. Aggregation

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 12 / 22

Page 13: Data Cleaning

Data Cleaning Strategy

Exception Handling

External functions written in a 3 GL language such as Java.

Exceptions autogenerated by the external functions.

Mark tuples that cannot be automatically handled by an operator.

Data lineage mechanism enables user inspection of exceptions.

Corrected data re-integrated into the data flow graph.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 13 / 22

Page 14: Data Cleaning

Specification Language

Logical Operators

Arbitrary clustering operations.

More general than the SQL group-by.

Merging operator with user defined aggregation functions.

Not expressible in SQL99.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 14 / 22

Page 15: Data Cleaning

Specification Language

Implementation of Matching

Optimization Problem.Pre-select the elements of the Cartesian product.

Allows false matches.No false dismissals.Cheap to compute.

Approximate method to compare a limited number of records.

With good expected probability.

Distance-filtering optimization.

Approximate Methods.

Multi-pass neighborhood method (MPN).

Choose a key.Compare the results within a fixed window.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 15 / 22

Page 16: Data Cleaning

User Involvement in Data Cleaning

Manual Data Repair (MDR) in Data Cleaning Graph(DCG)

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 16 / 22

Page 17: Data Cleaning

User Involvement in Data Cleaning

Case Study

Goal:

Clean the Pub table and produce a table containing only thepublications authored by at least one team member.

Duplicate entries for each publication organized in clusters.

Process:

Extract the author names.

Independently of the publication they are associated to.

Match author names against the names stored in the Team table.

Try to find synonyms.

Build the list of co-authors for each author.

Remove those publications that are not authored by any teammember.

Detect and cluster approximate duplicate publication records.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 17 / 22

Page 18: Data Cleaning

User Involvement in Data Cleaning

Quality Constraints and Manual Data Repairs

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 18 / 22

Page 19: Data Cleaning

Evaluation

Experiments

Executed with AJAX framework.

Multi-pass neighborhood method(MPN) vs. Neighborhood join (NJ).

Experimental Results:

Addressing the user feedback maysignificantly improve a data cleaningprocess.MPN faster, but less accurate thanNJ.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 19 / 22

Page 20: Data Cleaning

Evaluation

Related Work

High level languages for data transformations.

SQL99, WHIRL’s SQL, SchemaSQL.Lack of support for clustering and merging, and less optimized.Immediate halt of execution upon exception in SQL.Highly optimized matching operation in AJAX.

As it is made a first-citizen operator.

Data Integration and Cleaning Frameworks.

Less scale-up.

Algorithms to support matching, clustering, and merging operations.

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 20 / 22

Page 21: Data Cleaning

Conclusions

Conclusions

AJAX FrameworkDesign and Implementation of a data flow graph.

Quality heuristics for best accuracy.Effectively and efficiently generate clean data.

Design of performance heuristics.

Execution speed of transformations.

User involvement is crucial in data cleaning.

Thank you!

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 21 / 22

Page 22: Data Cleaning

Conclusions

References

Galhardas, H., Lopes, A., & Santos, E. (2011). Support for userinvolvement in data cleaning. In Data Warehousing and KnowledgeDiscovery (pp. 136-151). Springer Berlin Heidelberg..

Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000, May).AJAX: an extensible data cleaning tool. In ACM Sigmod Record (Vol.29, No. 2, p. 590). ACM.

Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C.(2001). Declarative data cleaning: Language, model, and algorithms.

“Precisionrecall” by Walber - Own work. Licensed under CC BY-SA4.0 via Wikimedia Commons -http://commons.wikimedia.org/wiki/File:

Precisionrecall.svg#/media/File:Precisionrecall.svg

Pradeeban Kathiravelu (IST-ULisboa) Data Cleaning 22 / 22