data cleaning and transformation

Data Cleaning and Transformation

Helena GalhardasDEI IST

(based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004))

Agenda

Introduction Data (Quality) Problems Data Cleaning and Transformation Data Quality

(Data Quality) Dimensions Techniques

Object Identification Data integration/fusion

Tools

When materializing the integrated data (data warehousing)…

DataExtraction

DataLoading

DataTransformation ... ...

SOURCE DATA TARGET DATA

ETL: Extraction, Transformation and Loading

70% of the time in a datawarehousing project is spent with the ETL process

Why Data Cleaning and Transformation?Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“”noisy: containing errors or outliers (spelling, phonetic and

typing errors, word transpositions, multiple values in a single free-form field)

e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names

(synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials)

e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Why Is Data Dirty? Incomplete data comes from:

non available data value when collected different criteria between the time when the data was

collected and when it is analyzed. human/hardware/software problems

Noisy data comes from: data collection: faulty instruments data entry: human or computer errors data transmission

Inconsistent (and redundant) data comes from: Different data sources, so non uniform naming

conventions/data codes Functional dependency and/or referential integrity violation

Why is Data Quality Important?Activity of converting source data into target data without

errors, duplicates, and inconsistencies, i.e.,

Cleaning and Transforming to get… High-quality data!

No quality data, no quality decisions! Quality decisions must be based on good quality data (e.g.,

duplicate or missing data may cause incorrect or even misleading statistics)

Data Quality: Multidimensional Concept Accuracy

Jhn vs. John

Currency Residence

(Permanent) Address: out-dated vs. up-to-dated

Consistency ZIP Code and City

consistent

Roma00198113SalariaVia

CityZipCodeNumberStreetNamePrefix

RomaSalaria


AttributeCompleteness


Roma0019374GracchiVia




EntityCompleteness



RomaSalaria


AttributeCompleteness


Roma0019374GracchiVia




EntityCompleteness

Completeness

batini

First of all. data quality is a multidimensional conept. As we will see about 50 different dimensions have been proposed in the literature. Four of them are shown here.Accuracyof Jnh is intuitively less acurate than John, ...Consistency pertains with several aspects, e.g. the coherence of ZIP code and city in a database of permanent addresses of Citizens.Concerning completness, intutively it may deal with full or only partial knowledge of the records in a table (we name this entity completness) or else full or partial knowledge of the attributes of a record (attribute completness)

8

Application contexts Eliminate errors and duplicates within a single source

E.g., duplicates in a file of customers

Integrate data from different sources E.g.,populating a DW from different operational data stores

Migrate data from a source schema into a different fixed target schema E.g., discontinued application packages

Convert poorly structured data into structured data E.g., processing data collected from the Web

Types of information systemsDistribution

Heterogeneity

Autonomy

no semi totally

yes

yes

P2PSystems

DWSystems

Monolithic

CISs

batini

The second area we have to consider in order to express (?) priorities concern the types of information systems. Here we hinerit and adapt a classification proviede by Valduriez for data base management systems, based on the degree of distrubution, heterogeneity and autonomy. Among all possible combinations of levels, apart from CISs, we are interested to some extent in traditional Monolitic systems, considered in several methodologies. While only in open problems we will discuss DQ in Peer to Peer systems, whose importance in research is growing.

Types of data Many authors

Structured Semi structured Unstructured

[Shankaranaian et al 2000] Raw data Component data Information products

[Bouzenghoub et al 2004] Stable, Long term changing Frequently changing

[Dasu et al 2003] Federated data, that come

form different heterogeneous sources

Massive high dimensional data

Descriptive data Longitudinal data,

consisting in time series Web data

[Batini et al 2003] Elementary data Aggregated data

EGov ScientificData

WebData …

Research areas in DQ systems

Application Domains

Research areas

ModelsMeasurement/Improvement

TechniquesDi

men

sions

Measurement/Improvement Tools and Frameworks M

etho

dolo

gies

Data Quality

Data Cleanin

gData

Integration

Data Mining

Statistical Data Analysis

ManagementInformation

Systems

• Record Matching(deduplication)• Data Transformation •…

• Error Localization• DB profiling• Patterns in text strings•…

• Conflict Resolution• Record Matching• …

• Source Selection• Source Composition• Query Result Selection• Time Syncronization•…

• Assessment• Process Improvement• Tradeoff Cost/Optimization•…

Knowledge Representation

Research issues related to DQ

• Edit-imputation • Record Linkage•…

•Conflict Resolution•…

Data Quality

Data Integration

Data Cleanin

g

Data Mining

Statistical Data Analysis

ManagementInformation

Systems

Knowledge Representation

Research issues mainly addressed in the book

batini

Among them the areas that will be investigatedmost of all, in their relationship qith data quality, are data integration, mis, and data cleaning.

Existing technology Ad-hoc programs written in a programming language like C

or Java or using an RDBMS proprietary language Programs difficult to optimize and maintain

RDBMS mechanisms for guaranteeing integrity constraints Do not address important data instance problems

Data transformation scripts using an ETL(Extraction-Transformation-Loading) or data quality tool

Typical architecture of a DQ system

HumanKnowledge

HumanKnowledge

DataExtraction

DataLoading

DataTransformation

Metadata Dictionaries DataAnalysis

SchemaIntegration

... ...

SOURCE DATA TARGET DATA

DataTransformation

Data Quality Dimensions

Traditional data quality dimensions Accuracy Completeness Time-related dimensions: Currency, Timeliness,

and Volatility Consistency

Schema quality dimensions are also defined Their definitions do not provide quantitative

measures so one or more metrics have to be associated For each metric, one or more measurement methods

have to be provided regrading: (i) where the measurement is taken; (ii) what data are included; (iii) the measurement device; and (iv) the scale on which results are reported.

Accuracy Closeness between a value v and a value v’, considered

as the correct representation of the real-world phenomenon that v aims to represent. Ex: for a person name “John”, v’=John is correct, v=Jhn is

incorrect Syntatic accuracy: closeness of a value v to the elements

of the corresponding definition domain D Ex: if v=Jack, even if v’=John , v is considered syntactically correct Measured by means of comparison functions (e.g., edit distance)

that returns a score Semantic accuracy: closeness of the value v to the true

value v’ Measured with a <yes, no> or <correct, not correct> domain Coincides with correctness The corresponding true value has to be known

Ganularity of accuracy definition Accuracy may refer to:

a single value of a relation attribute an attribute or column a relation the whole database

Example of metric to quantify data accuracy: Ratio calculated between accurate values and the

total number of values.

Completeness “The extent to which data are of sufficient

breadth, depth, and scope for the task in hand.” Three types:

Schema completeness: degree to which concepts and their properties are not missing from the schema

Column completeness: measure of the missing values for a specific property or column in a table.

Population completeness: evaluates missing values with respect to a reference population

Completeness of relational data The completeness of a table characterizes the extent

to which the table represents the real world. Can be characterized wrt:

The presence/absence and meaning of null values Ex: Person(name, surname, birthdate, email), if email is null may

indicate the person has no mail (no incompleteness), email exists but is not known (incompletenss), is is not known whether Person has an email (incompleteness may not be the case)

Validity of open world assumption (OWA) or closed world assumption (CWA) OWA: cannot state neither the truth or falsity of facts not

represented in the tuples of a relation CWA: only the values actually present in a relational table and no

other values represent facts of the real world.

Metrics for quantifying completeness (1) Model without null values with OWA

Need a reference relation r’ for a relation r, that contains all the tuples that satisfy the schema of rC(r) = |r|/|ref(r)|

Ex: according to a registry of Lisbon municipality, the number of citizens is 2 million. If a company stores data about Lisbon citizens for the purpose of its business and that number is 1,400,000 then C(r) = 0,7

Metrics for quantifying completeness (2) Model with null values with CWA: specific definitions for different granularities: Values: to capture the presence of null values for

some fields of a tuple Tuple: to characterize the completeness of a tuple

wrt the values of all its fields: Evaluates the % of specified values in the tuple wrt the

total number of attributes of the tuple itself Ex: Student(stID, name, surname, vote, examdate) (6754, Mike, Collins, 29, 7/17/2004) vs (6578, Julliane,

Merrals, NULL, 7/17/2004)

Metrics for quantifying completeness (3) Attribute:to measure the number of null values of

a specific attribute in a relation Evaluates % of specified values in the column

corresponding to the attribute wrt the total number of values that should have been specified.

Ex: For calculating the average of votes in Student, a notion of the completeness of Vote should be useful

Relations: to capture the number of null values of a specific attribute in a relation Measures how much info is represented in the relation

by evaluating the content of the info actually available wrt the maximum possible content, i.e., without null values.

Time-related dimensions Currency: concerns how promptly data are updated

Ex: if the residential address of a person is updated (it corresponds to the address where the person lives) then it is updated

Volatility: characterizes the frequency with which data vary in time EX: Birth dates (volatility zero) vs stock quotes (high degree of

volatility) Timeliness: expresses how current data are for the task

in hand Ex: The timetable for university courses can be current by

containing the most recent data, but it cannot be timely if it is available only after the start of the classes.

Metrics of time-related dimensions Last update metadata for currency

Straightforward for data types that change with a fixed frequency

Length of time that data remain valid for volatility

Currency + check that data are available before the planned usage time for timeliness

Consistency

Captures the violation of semantic rules defined over a set of data items, where data items can be tuples of relational tables or records in a file Integrity constraints in relational data

Domain constraints, Key, inclusion and functional dependencies

Data edits: semantic rules in statistics

Evolution of dimensions Traditional dimensions are Accuracy,

Completeness, Timeliness, Consistency1. With the advent of networks, sources increase

dramatically, and data become often “found data”.2. Federated data, where many disparate data are

integrated, are highly valued3. Data collection and analysis are frequently

disconnected. As a consequence we have to revisit the concept

of DQ and new dimensions become fundamental.

Other dimensions Interpretability: concerns the documentation

and metadata that are available to correctly interpret the meaning and properties of data sources

Synchronization between different time series: concerns proper integration of data having different time stamps.

Accessibility: measures the ability of the user to access the data from his/her own culture, physical status/functions, and technologies availavle.

Techniques

Relevant activities in DQ Techniques for record

linkage/object identification/record matching

Techniques for data integration

Relevant activities in DQ

Relevant activities in DQ Record Linkage/Object identification/Entity identification/Record

matching Data integration

Schema matching Instance conflict resolution Source selection Result merging Quality composition

Error localization/Data Auditing Data editing-imputation/Deviation detection

Profiling Structure induction

Data correction/data cleaning/data scrubbing Schema cleaning Tradeoff/cost optimization

Record Linkage/Object identification/ Entity identification/Record matchingGiven two tables or two sets of tables,

representing two entities/objects of the real world, find and cluster all records in tables referring to the same entity/object instance.

Data integration (1)

1. Schema matchingTakes two schemas as input and

produces a mapping between semantically correspondent elements of the two schemas

Data Integration (2)2. Instance conflicts resolution & merging

Instance level conflicts can be of three types:representation conflicts, e.g. dollar vs. eurokey equivalence conflicts, i.e. same real world

objects with different identifiersattribute value conflicts, i.e. instances

corresponding to same real world objects and sharing an equivalent key, differ on other attributes

Data Integration (3)3. Result merging: it derives from

the combination of individual answers into one single answer returned to the user

For numerical data, it is often called fused answer and a single value is returned as an answer, potentially differing from each of the alternatives

Data Integration (4)4. Source selection

Querying a multidatabase with different sources characterized by different qualities

5. Quality compositionDefines an algebra for composing data

quality dimension values

Error localization/Data Auditing

Given one/two/n tables or groups of tables, and a group of integrity constraints/qualities (e.g. completeness, accuracy), find records that do not respect the constraints/qualities.

Data editing-imputation Focus on integrity constraints

Deviation detection data checking that marks deviations as possible data

errors

Profiling

Evaluating statistical properties and intensional properties of tables and records

Structure induction of a structural description, i.e. “any form of regularity that can be found”

Data correction/data cleaning/data scrubbing Given one/two/n tables or groups of

tables, and a set of identified errors in records wrt to given qualities, generates probable corrections (deviation detection) and correct the records, in such a way that new records respect the qualities.

Schema cleaning

Transform the conceptual schema in order to achieve or optimize a given set of qualities (e.g. Readability, Normalization), while preserving other properties (e.g. equivalence of content)

Tradeoff/cost optimization

Tradeoff – When some desired qualities are conflicting (e.g. completeness and consistency), optimize such properties according to a given target

Cost – Given a cost model, optimize a given request of data quality according to a cost objective coherent with the cost model

Techniques for Record Linkage/

Object identification/Entity identification/

Record matching

Techniques for record linkage/object identification/record matching Introduction to techniques General strategies Details on specific steps Short profiles of techniques [Detailed description] [Comparison]

Introduction to techniques

An example

The same business as represented in the three most important business data bases in central agencies in Italy: - CC Chambers of Commerce- INPS Social security - INAIL Accident Insurance

Id Name Type of activity Address City

The three CC, Inps and Inail recordsCNCBTB765SDV Meat production of Bartoletti

BenitoRetail of bovine and ovine meats National Street dei Giovi

Novi Ligure

0111232223 Bartoletti Benito meat production

Grocer’s shop, beverages 9, Rome Street

Pizzolo Formigaro

CNCBTR765LDV Meat production in Piemonte of Bartoletti Benito

Butcher 4, Mazzini Square

Ovada

Record linkage and its evolution towards object identification in databases ….

Crlo Batini BtiniCarlo55 54

Record linkageFirst record and second record represent the same aspect of reality?

Object identification in databasesFirst group of records and second group of records represent the same aspect of reality?

Crlo Batini Pescara Pscara Itly Europe

Carlo Batini Pscara Pscara Italy Europe

Type of data considered (Two) formatted tables

Homogeneous in common attributes Heterogeneous

Format (Full name vs Acronym) Semantics (e.g. Age vs Date of Birth) Errors

(Two) groups of tables Dimensional hierarchy

(Two) XML documents

…and object identification in semistructured documents

<country> <name> United States of America </name><cities> New York, Los Angeles, Chicago

</cities><lakes>

<name> Lake Michigan </name></lakes>

</country>

<country> United States <city> New York </city>

<city> Los Angeles </city><lakes>

<lake> Lake Michigan </lake></lakes>

</country>

are the sameobject?

and

Relevant steps

A

B

Input files

A x B

InitialSearch space

Possiblematched

S’, Subset of A x B

Reduced searchSpace S’

Blockingmethod

Match

Unmatch

Assignment

Decision model

General strategy and phases (1)0. Preprocessing

Standardize fields to compare and correct simple errors1. Estabilish blocking method (also called searching method)

Given the search space S = A x B of the two files, find a new search space S’ contained in S, to apply further steps.

2. Choose comparison function Choose the function/set of rules that express the distance

between pairs of records in S’3. Choose decision model

Choose the method for assigning pairs in S’ to M, the set of matching records, U the set of unmatching records, and P the set of possible matches

4. Check effectiveness of method

General strategy and phases (2)- [Preprocessing]

- Estabilish a blocking method - Compute comparison function

- Apply decision model - [Check effectiveness of method]

Paradigms used in Techniques Empirical – heuristics, algorithms,

simple mathematical properties Statistical/probabilistic – properties and

formalisms in the areas of probability theory and Bayesian networks

Knowledge based – formalisms and models in knowledge representation and reasoning

General strategy – probabilistic methods

[Preprocessing] Establish blocking method Compute comparison function Compute probabilities and thresholds

With training setWithout training set

[Check effectiveness of method]

General strategy - knowledge based methods

[Preprocessing] Establish blocking method Get domain knowledge Choose transformations/rules Choose most relevant rules Apply rules using a knowledge based

engine [Check effectiveness of method]

General strategy -hierarchical structures/tree structures- [Preprocessing]

- Structure traversal- Estabilish blocking/filtering method- Compute comparison function- Apply local/global decision model - [Check effectiveness of method]

Crlo Batini Pescara Pscara Itly Europe

Carlo Batini Pscara Pscara Italy Europe

Details on specific steps

Preprocessing Error correction (with the closest value in the domain)

E.g. Crlo Carlo Elimination of stop words

Of, in, …. Standardization

E.g.1 Bob Robert E.g.2

Mr George W Bush

George Bush

Mr George W Bush

George Bush

Types of Blocking/Searching methods (1) Blocking - partition of the file into mutually exclusive blocks.

Comparisons are restricted to records within each block.

Blocking can be implemented by: Sorting the file according to a block key (chosen

by an expert). Hashing - a record is hashed according to its

block key in a hash block. Only records in the same hash block are considered for comparison.

61

String distance filtering

S1 S2

maxDist = 1

John Smith

John Smit

Jogn Smith

John Smithe

length

length- 1

length

length + 1

editDistance

Type of Blocking/Searching methods (2) Sorted neighbour or Windowing – moving a

window of a specific size w over the data file, comparing only the records that belong to this window

Multiple windowing - Several scans, each of which uses a different sorting key may be applied to increase the possibility of combining matched records

Windowing with choice of key based on the quality of the key

Window scanning

S

n

Window scanning

S

n

May loose some matches

Comparison functionsType ExampleIdentity ‘Batini’ = ‘Batini’Simple distance ‘Batini’ similar to ‘Btini’Complex distance ‘Batini’ similar to ‘Btaini’

Error driven dis-tance(Soundex)

Hilbert similar to Heilbpr

Frequency weighted distance

1. Smith is more frequent than Zabrinsky2. In IBM research, IBM less frequent than research

Transformation John Fitzgerald Kennedy Airport Acronym JFK Airport

Rule (Clue) Seniors appear to have one residence in Florida or Arizona and another in the Midwest or Northeast regions of the country

More on comparison functions/ String comparators (1)Hamming distance - counts the number of mismatches

between two numbers or text strings ( fixed length strings)

Ex: The Hamming distance between 00185 and 00155 is 1, because there is one mismatch

Edit distance - the minimum cost to convert one of the strings to the other by a sequence of character insertions, deletions and replacements. Each one of these modifications is assigned a cost value.

Ex: the edit distance between “Smith” and “Sitch” is 2, since “Smith” is obtained by adding “m” and deleting “c” from “Sitch”.

More on comparison functions/ String comparators (2)Jaro’s algorithm or distance finds the number of common

characters and the number of transposed characters in the two strings. J distance accounts for insertions, deletions, and transpositions.

A common character is a character that appears in both strings within a distance of half the length of the shorter string. A transposed character is a common character that appears in different positions.

Ex: Comparing “Smith” and “Simth”, there are five common characters, two of which are transposed

Scaled Jaro string comparator:f(s1, s2) = (Nc/lengthS1+Nc/lengthS2+0,5Nt/Nc)/3,

Where Nc is the nb of comon characters between the two strings and Nt is the number of transpositions

More on comparison functions – String comparators (3)N-grams comparison function forms the set of all

the substrings of length n for each string. The distance between the two strings is defined as square radix of the sum of differences among the number of occurrences of the substring x in the two strings a and b, respectively.

Soundex code clusters together names that have similar sounds. For example, the Soundex code of “Hilbert” and “Heilbpr” is similar.

More on comparison functions – String comparators (4)TF-IDF (Token Frequency-Inverse Document

Frequency) or cosine similarity assigns higher weights to tokens appearing frequently in a document (TF weight) and lower weights to tokens that appear frequently in the whole set of documents (IDF weight).

Widely used for matching strings in documents

Main criteria of useHamming distance Used primarily for numerical fixed size fields

like Zip Code or SSNEdit distance Can be applied to variable length fields. To

achieve reasonable accuracy, the modifications costs are tuned for each string data set.

Jaro’s algorithm and Winkler’s improvement

The best one in several experiments

N - grams Bigrams (n = 2) effective with minor typographical errors

TF-IDF Most appropriate for text fields or documents

Soundex code Effective for dictation mistakes

Techniques(probabilistic, knowledge based, empirical)

List of techniquesName Main Reference Type of strategy Fellegy and Sunter family [Fellegi 1969] probabilistic

[Nigam 2000] [Nigam 2000] probabilistic

Cost based [Elfeky 2002] probabilistic

Induction [Elfeky 2002] probabilistic

Clustering [Elfeky 2002] probabilistic

Hybrid [Elfeky 2002] probabilistic

1-1 matching [Winkler 2004] probabilistic

Bridging file [Winkler 2004] probabilistic

Delphi [Ananthakr. 2002] empirical

XML object identification [Weis 2004] empirical

Sorted Neighbour and variants [Hernandez 1995][Bertolazzi 2003]

empirical

Instance functional dependencies [Lim et al 1993] Knowledge based

Clue based [Buechi 2003] Knowledge based

Active Atlas [Tejada 2001] Knowledge based

Intelliclean and SN variant [Low 2001] Knowledge based

Added techniquesName Main Reference Type of strategy

Approximate string join [Gravano 2001]

empirical

Text join [Gravano 2003]

empirical

Flexible string matching [Koudas 2004] empirical

Approximate XML joins [Guha 2002] empirical

Empirical techniques

Sorted Neighbour searching method

1. Create Key: Compute a key K for each record in the list by extracting relevant fields or portions of fields. Relevance is decided by experts.

2. Sort Data: Sort the records in the data list using K 3. Merge: Move a fixed size window through the

sequential list of records limiting the comparisons for matching records to those records in the window. If the size of the window is w records, then every new record entering the window is compared with the previous w – 1 records to find “matching” records

SNM: Create key

Compute a key for each record by extracting relevant fields or portions of fields

Example:

First Last Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

SNM: Sort Data

Sort the records in the data list using the key in step 1

This can be very time consuming O(NlogN) for a good algorithm, O(N2) for a bad algorithm

SNM: Merge records

Move a fixed size window through the sequential list of records.

This limits the comparisons to the records in the window

SNM: Considerations

What is the optimal window size while Maximizing accuracy Minimizing computational cost

Execution time for large DB will be bound by Disk I/O Number of passes over the data set

Selection of Keys

The effectiveness of the SNM highly depends on the key selected to sort the records

A key is defined to be a sequence of a subset of attributes

Keys must provide sufficient discriminating power

Example of Records and Keys

First Last Address ID Key



Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456

Equational Theory

The comparison during the merge phase is an inferential process

Compares much more information than simply the key

The more information there is, the better inferences can be made

Equational Theory - Example Two names are spelled nearly identically and

have the same address It may be inferred that they are the same person

Two social security numbers are the same but the names and addresses are totally different Could be the same person who moved Could be two different people and there is an

error in the social security number

A simplified rule in English

Given two records, r1 and r2IF the last name of r1 equals the last name of r2,

AND the first names differ slightly,AND the address of r1 equals the address of r2

THENr1 is equivalent to r2

The distance function

A “distance function” is used to compare pieces of data (usually text)

Apply “distance function” to data that “differ slightly”

Select a threshold to capture obvious typographical errors. Impacts number of successful matches and

number of false positives

Examples of matched records

SSN Name (First, Initial, Last) Address334600443334600443

Lisa BoardmanLisa Brown

144 Wars St.144 Ward St.

525520001525520001

Ramon BonillaRaymond Bonilla

38 Ward St.38 Ward St.

00

Diana D. AmbrosionDiana A. Dambrosion

40 Brik Church Av.40 Brick Church Av.

789912345879912345

Kathi KasonKathy Kason

48 North St.48 North St.

879912345879912345

Kathy KasonKathy Smith

48 North St.48 North St.

Building an equational theory The process of creating a good equational

theory is similar to the process of creating a good knowledge-base for an expert system

In complex problems, an expert’s assistance is needed to write the equational theory

Transitive Closure

In general, no single pass (i.e. no single key) will be sufficient to catch all matching records

An attribute that appears first in the key has higher discriminating power than those appearing after them If an employee has two records in a DB with SSN

193456782 and 913456782, it’s unlikely they will fall under the same window

Transitive Closure

To increase the number of similar records merged Widen the scanning window size, w Execute several independent runs of the SNM

Use a different key each time Use a relatively small window Call this the Multi-Pass approach

Transitive Closure

Each independent run of the Multi-Pass approach will produce a set of pairs of records

Although one field in a record may be in error, another field may not

Transitive closure can be applied to those pairs to be merged

Multi-pass Matches

Pass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )KSNKAT48NRTH879 (Kathy Kason 879912345 )

Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )KATKSN48NRTH879 (Kathy Kason 879912345 )

Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )48NRTH879SMTKAT (Kathy Smith 879912345 )

Transitive Equality ExampleIF A implies B

AND B implies CTHEN A implies C

From example:789912345 Kathi Kason 48 North St. (A)879912345 Kathy Kason 48 North St. (B)879912345 Kathy Smith 48 North St. (C)

KB techniques -Types of knowledge considered

Paper Name of technique

Type of knowledge

[Lim et al 1993]

Instance level funct. dependencies

Instance level functional dependencies

[Low et al 2001]

Intelliclean Rules for duplicate identification and for merging records

[Tejadaa 2001]

Active Atlas Importance of the different attributes for deciding a mappingTransformations among values that are relevant for the mappings

[Buechi et al 2003]

Clue based Clues, i.e domain dependent properties of data to be linked

Intelliclean [Low 2001] Two files/knowledge based Exploits rules (see example) for duplicate

identification and for merging records. Rules are fed into an expert system engine The pattern matching and rule activation

mechanisms make use of an efficient method for comparing a large collection of patterns to a large collection of objects.

Example of rule

Intelliclean Sorted Neighbour Two files/knowledge based Improves multi pass SN by

attaching a confidence factor to rules and applying selectively the transitive closure.

probabilistic techniques

Fellegi and Sunter family Two files/ probabilistic

Compute in some way conditional probabilities of matching/not matching based on an agreement pattern (comparison function).

Establish two thresholds for deciding matching and non matching, based on a priori error bounds on false matches and false nonmatches.

Also called error based techniques Pros: well founded technique; widely applied, tuned and

optimized. Cons: in most cases, needed training data for good estimates.

Fellegi and Sunter RL family of models - 1For the two data sources A and B the set of pairs A x B

is mapped in two sets, M (matched) where a = b and U (unmatched) otherwise.

Having in mind that it is always better to classify a record pair as a possible match than to falsely decide on its matching status, a third set P, called possible matched, is defined. In the case that a record pair is assigned to P, a domain expert should manually examine this pair.

The problem, then, is to determine in which set each record pair belongs to, in presence of errors.

A comparison function C compares values of pairs of records, and produces a comparison vector.

Fellegi and Sunter RL family of models - 2

We may then assign a weight for each component of the record pair, calculating the composite weight and comparing the weight against two thresholds T1 and T2, and assigning M, U, or P (not assigned) according to the relative value of the composite weight (CW) wrt the two thresholds. In other words:- If CW > T2 then designate pair as a match M- If CW < T1 then designate pair as an unmatched U- Otherwise designate as not assigned.

Fellegi and Sunter RL family of models - 3The thresholds can be estimated by a priori error

bounds on false matches and false non matches.

This is the reason why the model is called error-based

Fellegi and Sunter proved that this decision procedure is optimal,in the sense that, fixed the rates of false matches and false non matches, the clerical review region is minimized.

Fellegi and Sunter RL family of models - 4The procedure has been investigated under different

assumptions, and for different comparison functions. This is the reason of the name family of models. E.g. When the comparison function is a simple

agree/disagree pattern for three variables, if the variables are independent ( independence assumption), we may express the probabilities that the record match/not match as product of elementary probabilities, and solve the system of equations in closed form.

More on independence assumption Independence assumption - To make computation more tractable, FS made a conditional independence assumption that corresponds to the naïve Bayesian assumption for BN:

P (agree on first name Robert, agree on last name Smith/C1) = P (agree on first name George/C1) P( agree on last name

Smith/C1). Under this assumption FS provided a method for computing the

probabilites in the 3 variable case. Winkler developed a variant of the EM algorithm for a more general case.

Nigam 2000 has shown that classification decision rules based on Naïve BN (independence assumption) work well in practice.

Dependence assumption – Varying authors have observed that the computed probabilities do not even remotely correspond to the true underlying probabilities.

Expectation maximization model An expectation-maximization (EM) algorithm is an

algorithm for finding maximum likelihood estimates of parameters (in our case, probabilities of match and not match) in probabilistic models, where the model depends on unobserved (latent) variables.

EM alternates between performing an expectation (E) step, which computes the expected value of the latent variables, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters given the data and setting the latent variables to their expectation.

Agenda

Introduction Data (Quality) Problems Data Cleaning and Transformation Data Quality

(Data Quality) Dimensions Techniques

Object Identification Data integration/fusion

Tools

Tools – table of contents DQ problems revisited Generic functionalities Tool categories Addressing data quality problems

Existing technology

How to achieve data quality Ad-hoc programs RDBMS mechanisms to guaranteeing

integrity constraints Data transformation scripts using a data

quality tool

Data quality problems (1/3)

Schema level data quality problems prevented with better schema design, schema translation and integration.

Instance level data quality problems errors and inconsistencies of data that are not prevented at schema level

Schema level data quality problems Avoided by an RDBMS

Missing data – product price not filled in Wrong data type – “abc” in product price Wrong data value – 0.5 in product tax (iva) Dangling data – category identifier of product does not exist Exact duplicate data – different persons with same ssn Generic domain constraints – incorrect invoice price

Not avoided by an RDBMS Wrong categorical data – countries and corresponding states Outdated temporal data – just-in-time requirement Inconsistent spatial data – coordinates and shapes Name conflicts – person vs person or person vs client Structural Conflicts - addresses


Instance level data quality problems Single record

Missing data in a not null field – ssn:-9999999 Erroneous data – price:5 but real price:50 Misspellings: José Maria Silva vs José Maria Sliva Embedded values: Prof. José Maria Silva Misfielded values: city: Portugal Ambiguous data: J. Maria Silva; Miami Florida,Ohio

Multiple records Duplicate records: Name:José Maria Silva, Birth:01/01/1950 and

Name:José Maria Sliva, Birth:01/01/1950 Contradicting records: Name:José Maria Silva, Birth:01/01/1950

and Name:José Maria Silva, Birth:01/01/1956 Non-standardized data: José Maria Silva vs Silva, José Maria


Generic functionalities (1/2) Data sources – extract data from different and

heterogeneous data sources Extraction capabilities – schedule extracts; select

data; merge data from multiple sources Loading capabilities – multiple types,

heterogeneous targets; refresh and append; automatically create tables

Incremental updates – update instead of rebuild from scratch

User interface – graphical vs non-graphical Metadata repository – stores data schemas and

transformations

Generic functionalities (2/2)

Performance techniques – partitioning; parallel processing; threads; clustering

Versioning – version control of data quality programs Function library – pre-built functions Language binding – language support to define new

functions Debugging and tracing – document the execution;

tracking and analyzing to detect problems and refine specifications

Exception detection and handling – set of input rules for which the execution of data quality program fails

Data lineage – identifies the input records that produced a given data item

Generic functionalities – table of tools

Commercial

Research

Tool categories

Data analysis – statistical evaluation, logical study and application of data mining algorithms to define data patterns and rules

Data profiling – analysing data sources to identify data quality problemsData transformation– set of operations that source data must undergo to fit target schemaData cleaning– detecting, removing and correcting dirty dataDuplicate elimination– identifies and merges approximate duplicate recordsData enrichement– use of additional information to improve data quality

Categories

Ajax Y Y Y Y

Arktos Y Y

Clio Y

Flamingo Project Y

FraQL Y Y Y

IntelliClean Y

Kent State University Y Y

Potter's Wheel Y Y

Transcm Y

http://web.tagus.ist.utl.pt/%7Ehelena.galhardas/ajax.html

http://www.dbnet.ece.ntua.gr/%7Epvassil/projects/arktos_II/index.html

http://portal.acm.org/citation.cfm?doid=1066252

http://www.ics.uci.edu/%7Eflamingo/

http://fusion.cs.uni-magdeburg.de/infuse/

http://www.comp.nus.edu.sg/%7Eiclean/

http://control.cs.berkeley.edu/abc/

More on research DQ toolsReference Name Main features

[Raman et al 2001]

Potter’s wheel

Tightly integrates transformations and discrepancy/anomaly detection

[Caruso et al 2000]

Telcordia’s tool

Record linkage tool parametric wrt to distance functions and matching functions

[Galhardas et al 2001]

Ajax Declarative language based on five logical transformation operatorsSeparates a logical level and a physical level

[Vassiliadis 2001]

Artkos Covering all aspects of ETL processes (architecture, activities, quality management) with a unique metamodel

[Elfeky et al 2002]

Tailor Framework for existing and new techniques analysis and comparison

[Buechi et al 2003]

Choicemaker Uses clues that allows rich expression of the semantics of data

[Low et al 2001]

Intelliclean See section on techniques

Data Fusion Instance conflict resolution

Instance conflicts resolution & merging Occur due to poor data quality

Errors exist when collecting or entering data Sources are not updated

Three types: representation conflicts, e.g. dollar vs. euro key equivalence conflicts, i.e. same real world objects

with different identifiers attribute value conflicts, i.e. instances corresponding

to same real world objects and sharing an equivalent key, differ on other attributes

ExampleEmpId Name Surname Salary Email

arpa78 John Smith 2000 [email protected]

eugi 98 Edward Monroe 1500 [email protected]

ghjk09 Anthony Wite 1250 [email protected]

treg23 Marianne Collins 1150 [email protected]

EmpId Name Surname Salary Email

arpa78 John Smith 2600 [email protected]

eugi 98 Edward Monroe 1500 [email protected]

ghjk09 Anthony White 1250 [email protected]

dref43 Marianne Collins 1150 [email protected]

Key conflicts

Attributeconflicts

Techniques

At design time or query time, even if actual conflicts occur at query time

Key conflicts require object identification techniques

There are several techniques for dealing with attribute conflicts

General approach for solving attribute-level conflicts Declaratively specify how to deal with

conflicts: Set of conflict resolution functions

E.g., count(), min(), max(), random(), etc Set of strategies to deal with conflicts

Correspond to diff. Tolerance degrees Query model that takes into account posible

conflicts directly Use SQL directly or extensions to it.

Conflict resolution and result merging: list of proposals SQL-based conflict resolution (Naumann et al 2002) Aurora (Ozsu et al 1999) Fan et al 2001 FraSQL-based conflict resolution (Sattler et al 2003) FusionPlex (Motro et al 2004) OOra (Lim et al 1998) DaQuinCIS (Scannapieco 2004) (Wang et al 2000) ConQuer (Fuxman 2005) (Dayal 1985)

Conflict resolution and result merging: list of proposals SQL-based conflict resolution (Naumann et al

2002) Aurora (Ozsu et al 1999) Fan et al 2001 FraSQL-based conflict resolution (Sattler et al 2003) FusionPlex (Motro et al 2004) OOra (Lim et al 1998) DaQuinCIS (Scannapieco 2004) (Wang et al 2000) ConQuer (Fuxman 2005) (Dayal 1985)

Declarative (SQL-based) data merging with conflicts resolution [Naumann 2002] Problem: Merging data in an integrated database or in a query against multiple sources

Solution: proposal of a set of resolution functions (e.g. MAX,

MIN, RANDOM; GROUP, AVG, LONGEST) declarative merging of relational data sources by

common queries through: Grouping queries Join queries (divide et impera) nested joining (one at a time, hierarchical)

Grouping queriesSELECT isbn, LONGEST(title), MIN(price)FROM booksGROUP BY isbn

If multiple sources involved with potential duplicates:SELECT books.isbn,LONGEST(books.title), MIN(books.price)FROM (

SELECT * FROM books1UNIONSELECT * FROM books2) AS books

GROUP BY books.isbn

Advantages: good performance

Inconvenient: user-defined aggregate functions not supported by most RDBMS

Join queries (1) Dismantle a query result that merges multiple

sources into different parts, resolve conflicts within each part, and union them to a final result.

Partitions the union of the two sources into : The intersection of the two:

Query 1SELECT books1.isbn, LONGEST(books1.title, books2.title),

MIN(books1.price, books2.price)FROM books1, books2WHERE books1.isbn = books2.isbn

Join queries (2)

Query 2aSELECT isbn, title, priceFROM books1WHERE books1.isbn NOT IN

(SELECT books1.isbn,FROM books1, books2WHERE books1.isbn = books2.isbn)

)

Advantages: Use of built-in scalar functions available by RDBMS

Query 2bSELECT isbn, title, priceFROM books1WHERE books1.isbn NOT IN

(SELECT books2.isbn,FROM books2)

Inconvs: Length and complexity of queries

The part of books 1 that does not intersect with books2 (Query 2a or 2b) and vice-versa

Advantages and inconvenients

Pros: the proposal considers the usage of SQL, thus exploiting capabilities of existing database systems, rather than proposing proprietary solutions

Cons: limited applicability of conflict resolution functions that often need to be domain-specific

References “Data Quality: Concepts, Methodologies and

Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, 2006.

“Duplicate Record Detection: A Survey”, A. Elmagarmid, P. Ipeirotis, V. Verykios, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, no. 1, Jan. 2007.

“A Survey of Data Quality tools”, J. Barateiro, H. Galhardas, Datenbank-Spektrum 14: 15-21, 2005.

“Declarative data merging with conflict resolution”, F. Naumann and M. Haeussler, IQ2002.

data cleaning and transformation

Documents

data collection

redundant data

data dirty

incomplete data

structured data

missing data

highquality data

data quality important