gti 2007/08 h.galhardas achieving data quality with ajax (first version of ajax designed and...

33
H.Galhardas GTI 2007/08 Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Achieving Data Quality with AJAX

(first version of AJAX designed and developed at INRIA Rocquencourt, France)

Page 2: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Existing technology• Ad-hoc programs written in a programming language like

C or Java or using an RDBMS proprietary language– Programs difficult to optimize and maintain

• RDBMS mechanisms for guaranteeing integrity constraints– Do not address important data instance problems

• Data transformation scripts using an ETL

(Extraction-Transformation-Loading) or data quality tool

Page 3: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Problems of data quality solutions (1)

The semantics of some data transformations is defined in terms of their implementation algorithms

App. Domain 1

App. Domain 2

App. Domain 3

Data cleaning transformations

...

Page 4: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

There is a lack of interactive facilities to tune a data cleaning application program

Problems of data quality solutions (2)

Dirty Data

Cleaning process

Clean data Rejected data

Page 5: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Motivating example (1)

DirtyData(paper:String)

Data Cleaning & Transformation

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)

Authors(authorKey, name)

PubsAuthors(pubKey, authorKey)

Page 6: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning & Transformation

PDIS | Conference on Parallel and Distributed Information Systems

Events

QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996

PublicationsAuthors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom…..

QGMW96 | DQua

QGMW96 | AGup….

PubsAuthors

Page 7: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Modeling a data quality process

A data quality process is modeled by a directed acyclic graph of data transformations

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags

Page 8: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A declarative language for logical operators – SQL extension

• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions

Page 9: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A declarative language for logical operators – SQL extension

• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions

Page 10: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Logical level: parametric operators

• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with

arbitrary user-defined functions• Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined

aggregate functions• Apply: executes an arbitrary user-defined

algorithm

Map Match

Merge

ClusterView

Apply

Page 11: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Logical level

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

Page 12: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Logical level

DirtyData

DirtyAuthors

Map

Cluster

Match

Merge

Authors

Map

Map

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

DirtyData

DirtyAuthors

TC

NL

Authors

SQL Scan

Java Scan

Physical level

DirtyTitles...

Java Scan

Java Scan

CitiesTags

Page 13: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Match• Input: 2 relations• Finds data records that correspond to the same

real object• Calls distance functions for comparing field values

and computing the distance between input tuples• Output: 1 relation containing matching tuples and

possibly 1 or 2 relations containing non-matching tuples

Page 14: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Example

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 15: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthorsCluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 16: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthors

Input:

DirtyAuthors(authorKey, name)861|johann christoph freytag

822|jc freytag

819|j freytag

814|j-c freytag

Output:

MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag

822|814|jc freytag|j-c freytag ...

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 17: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Implementation of the match operator

s1 S1, s2 S2

(s1, s2) is a match if

editDistance (s1, s2) < maxDist

Page 18: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Nested loopS1 S2

...

• Very expensive evaluation when handling large amounts of data

Need alternative execution algorithms for the same logical specification

editDistance

Page 19: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

A database solution

CREATE TABLE MatchAuthors ASSELECT authorKey1, authorKey2, distance

FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,

editDistance (a1.name, a2.name) distance

FROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

No optimization supported for a Cartesian product with external function calls

Page 20: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Window scanning

S

n

Page 21: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Window scanning

S

n

Page 22: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Window scanning

S

n

May loose some matches

Page 23: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

String distance filtering

S1 S2

maxDist = 1

John Smith

John Smit

Jogn Smith

John Smithe

length

length- 1

length

length + 1

editDistance

Page 24: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Annotation-based optimization

• The user specifies types of optimization • The system suggests which algorithm to

use

Ex:

CREATE MATCHING MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET dist = editDistance(da1.name, da2.name)

WHERE dist < maxDist

% distance-filtering: map= length; dist = abs %

INTO MatchAuthors

Page 25: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A declarative language for logical operators – SQL extension

• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions

Page 26: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING (600)

DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)

DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper}

Declarative specification

Page 27: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING (600)

DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)

DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper}

Graph of data transformations

Declarative specification

Page 28: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A declarative language for logical operators – SQL extension

• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions

Page 29: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Management of exceptions

• Problem: to mark tuples not handled by the cleaning criteria of an operator

• Solution: to specify the generation of exception tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated

Page 30: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Debugger facility

• Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions

• Supports the interactive data modification and, in the future, the incremental execution of logical operators

Page 31: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Debugging exceptions

Page 32: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

Architecture

Page 33: GTI 2007/08 H.Galhardas Achieving Data Quality with AJAX (first version of AJAX designed and developed at INRIA Rocquencourt, France)

H.GalhardasGTI 2007/08

References

• Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: 371-380