![Page 1: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/1.jpg)
1SAD Tagus
AJAX:Model, Declarative Language,
and Algorithms
Helena Galhardas
![Page 2: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/2.jpg)
2SAD Tagus
Plan
Context
• Problem statement
• Contributions
• Our data cleaning solution
• Validation
• Related solutions
• Conclusions
![Page 3: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/3.jpg)
3SAD Tagus
Application context
– Eliminate errors and duplicates within a single
source
– Integrate data from different sources
– Migrate poorly structured data into structured
data
![Page 4: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/4.jpg)
4SAD Tagus
Typical architecture HumanKnowledge
HumanKnowledge
DataExtraction
DataLoading
DataTransformation
Metadata Dictionaries DataAnalysis
SchemaIntegration
... ...
SOURCE DATA TARGET DATA
DataTransformation
![Page 5: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/5.jpg)
5SAD Tagus
Data cleaning
Activity of transforming source data into target data without errors, duplicates, and inconsistencies
![Page 6: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/6.jpg)
6SAD Tagus
Motivating example (1)
DirtyData(paper:String)
Data Cleaning
Events(eventKey, name)
Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)
Authors(authorKey, name)
PubsAuthors(pubKey, authorKey)
![Page 7: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/7.jpg)
7SAD Tagus
Motivating example (2)
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95
DirtyData
Data Cleaning
PDIS | Conference on Parallel and Distributed Information Systems
Events
QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996
PublicationsAuthors
DQua | Dallan Quass
AGup | Ashish Gupta
JWid | Jennifer Widom…..
QGMW96 | DQua
QGMW96 | AGup….
PubsAuthors
![Page 8: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/8.jpg)
8SAD Tagus
Plan
• Context Problem statement
• Contributions
• Our data cleaning solution
• Validation
• Related solutions
• Conclusions
![Page 9: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/9.jpg)
9SAD Tagus
Modeling a data cleaning process
A data cleaning process is modeled by a directed acyclic graph of data transformations
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles... DirtyEvents
CitiesTags
![Page 10: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/10.jpg)
10SAD Tagus
Existing technology
• Ad-hoc code– difficult to maintain
• Extraction Transformation Loading (ETI, Informatica, Sagent)
– limited cleaning functionality
• Data Reengineering (Integrity) – fixed implementation for certain operators
• Specific-domain cleaning (idCentric, PureIntegrate)
– names and addresses
• Duplicate elimination (DataCleanser, matchIt)
– finds/eliminates duplicates
![Page 11: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/11.jpg)
11SAD Tagus
Problems of existing solutions (1)
The semantics of some data transformations is defined in terms of their implementation algorithms
App. Domain 1
App. Domain 2
App. Domain 3
Data cleaning transformations
...
![Page 12: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/12.jpg)
12SAD Tagus
There is a lack of interactive facilities to tune a data cleaning application program
Problems of existing solutions (2)
Dirty Data
Cleaning process
Clean data Rejected data
![Page 13: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/13.jpg)
13SAD Tagus
AJAX
• An extensible data cleaning framework
• A declarative language for logical operators
• Efficient implementation of the match operator
• A debugger facility for tuning a data cleaning program application
![Page 14: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/14.jpg)
14SAD Tagus
Data cleaning framework
• Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation
• Physical level: set of algorithms that implement the logical operations
![Page 15: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/15.jpg)
15SAD Tagus
Logical level: parametric operators
• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with
arbitrary user-defined functions• Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined
aggregate functions• Apply: executes an arbitrary user-defined
algorithm
Map Match
Merge
ClusterView
Apply
![Page 16: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/16.jpg)
16SAD Tagus
Logical level
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
![Page 17: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/17.jpg)
17SAD Tagus
Logical level
DirtyData
DirtyAuthors
Map
Cluster
Match
Merge
Authors
Map
Map
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
DirtyData
DirtyAuthors
TC
NL
Authors
SQL Scan
Java Scan
Physical level
DirtyTitles...
Java Scan
Java Scan
CitiesTags
![Page 18: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/18.jpg)
18SAD Tagus
Contributions
• An extensible data cleaning framework
A declarative language for logical operators
• Efficient implementation of the match operator
• A debugger facility for tuning a data cleaning program application
![Page 19: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/19.jpg)
19SAD Tagus
Match• Input: 2 relations• Finds data records that correspond to the same
real object• Calls distance functions for comparing field values
and computing the distance between input tuples• Output: 1 relation containing matching tuples and
possibly 1 or 2 relations containing non-matching tuples
![Page 20: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/20.jpg)
20SAD Tagus
Example
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
![Page 21: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/21.jpg)
21SAD Tagus
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthorsCluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
![Page 22: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/22.jpg)
22SAD Tagus
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthors
Input:
DirtyAuthors(authorKey, name)861|johann christoph freytag
822|jc freytag
819|j freytag
814|j-c freytag
Output:
MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag
822|814|jc freytag|j-c freytag ...
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
![Page 23: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/23.jpg)
23SAD Tagus
Implementation of the match operator
s1 S1, s2 S2
(s1, s2) is a match if
editDistance (s1, s2) < maxDist
![Page 24: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/24.jpg)
24SAD Tagus
Nested loopS1 S2
...
• Very expensive evaluation when handling large amounts of data
Need alternative execution algorithms for the same logical specification
editDistance
![Page 25: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/25.jpg)
25SAD Tagus
A database solution
CREATE TABLE MatchAuthors ASSELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,
editDistance (a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
No optimization supported for a Cartesian product with external function calls
![Page 26: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/26.jpg)
26SAD Tagus
Window scanning
S
n
![Page 27: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/27.jpg)
27SAD Tagus
Window scanning
S
n
![Page 28: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/28.jpg)
28SAD Tagus
Window scanning
S
n
May loose some matches
![Page 29: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/29.jpg)
29SAD Tagus
String distance filtering
S1 S2
maxDist = 1
John Smith
John Smit
Jogn Smith
John Smithe
length
length- 1
length
length + 1
editDistance
![Page 30: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/30.jpg)
30SAD Tagus
Annotation-based optimization
• The user specifies types of optimization • The system suggests which algorithm to
use
Ex:
CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist
% distance-filtering: map= length; dist = abs %
INTO MatchAuthors
![Page 31: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/31.jpg)
31SAD Tagus
Contributions
• An extensible data cleaning framework
• A declarative language for logical operators
• Efficient implementation of the match operator
A debugger facility for tuning a data cleaning program application
![Page 32: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/32.jpg)
32SAD Tagus
Management of exceptions
• Problem: to mark tuples not handled by the cleaning criteria of an operator
• Solution: to specify the generation of exceptional tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated
![Page 33: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/33.jpg)
33SAD Tagus
Example (1)
CREATE MAP ExtractionCities
FROM StandardizedDirtyData dd
LET city = extractCities(dd.paper, Cities),
{ SELECT dd.paperKey AS pubKey, city AS city
INTO ExtractedCities
CONSTRAINT NOT NULL city } Map
ExtractedCities(pubKey, city)
Extraction
CitiesStandardizedDirtyData (pubKey, paper)
![Page 34: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/34.jpg)
34SAD Tagus
Example(2)
ExtractionCities
Cities
ExtractedCitiesStandardizedDirtyDataexc
4| ManyDifferentCities
StandardizedDirtyData
4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park
![Page 35: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/35.jpg)
35SAD Tagus
Debugger facility
• Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions
• Supports the interactive data modification and the incremental execution of some logical operators
![Page 36: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/36.jpg)
36SAD Tagus
4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992
4| ManyDifferentCities
4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland
StandardizedDirtyDataForExtraction
StandardizeDataForExtraction
ExtractionAuthorsTitleEvent
DirtyEvents
KeyDirtyData
StandardizeData
StandardizedDirtyData
ExtractionCities
ExtractedCitiesStandardizedDirtyDataexc
BackwardDerivationForwardDerivation
Backward/forward data derivation
Cities
![Page 37: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/37.jpg)
37SAD Tagus
4| ManyDifferentCities
4|Technical Report, Univ. Of Wisconsinand Univ. Of Maryland
StandardizedDirtyDataForExtraction
StandardizeDataForExtraction
ExtractionAuthorsTitleEvent
DirtyEvents
KeyDirtyData
StandardizeData
StandardizedDirtyData
ExtractionCities
ExtractedCitiesStandardizedDirtyDataexc
4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992
Interactive data correction (1)
Cities
![Page 38: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/38.jpg)
38SAD Tagus
4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992
KeyDirtyData
Interactive data correction(2) 4| Technical Report, Univ. Of Wisconsin101| Technical Report, Univ. Of Maryland 4| Madison
101| College Park
StandardizedDirtyDataForExtraction
StandardizeDataForExtraction
ExtractionAuthorsTitleEvent
DirtyEvents
StandardizeData
StandardizedDirtyData
ExtractionCities
ExtractedCities
incrementalincremental
incrementalincremental
Cities
![Page 39: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/39.jpg)
39SAD Tagus
AJAX Architecture
![Page 40: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas](https://reader035.vdocuments.mx/reader035/viewer/2022062715/56649d785503460f94a5ba8c/html5/thumbnails/40.jpg)
40SAD Tagus
AJAX Demo