potter's wheel: an interactive framework for data cleaning and

23

Upload: letu

Post on 04-Jan-2017

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Potter's Wheel: An Interactive Framework for Data Cleaning and

.

SIGMOD 2000 Paper Id: 128

Potter's Wheel: An Interactive Framework for Data Cleaning and Transformation

Vijayshankar Raman Joseph M. Hellerstein

University of California, Berkeley

Abstract

Real world data often has discrepancies in structure and content. Traditional methods for \cleaning"

the data involve many iterations of time-consuming \data quality" analysis to �nd discrepancies, and

long-running transformations to �x them. This process requires users to endure long waits and often

write complex transformation programs.

We present an interactive framework for data cleaning that tightly integrates transformation and

discrepancy detection. Users gradually build transformations by adding or undoing transforms, in a

intuitive, graphical manner through a spreadsheet-like interface; the e�ect of a transform is shown at

once on records visible on screen. Discrepancy detection is done incrementally in the background on the

current transformed version of data, and discrepancies are agged as they are found. This interactive

combination of discrepancy detection and transformation allows users to gradually develop transforma-

tions as discrepancies are found, and clean the data without having to write complex programs or endure

long waits. Balancing the goals of power, ease of speci�cation, and interactive application, we develop

a set of transforms that can be used for transformations within data records as well as for higher-order

transformations. We also study the compilation of sequences of transforms into optimized programs.

1 Introduction

Real world data often has inconsistencies in schema, data formats, spellings, and adherence to constraints [4,

20, 13]. These can arise from a variety of causes such as merging from multiple sources and data entry errors.

Organizations often want to consolidate data stored in di�erent databases for ease of access and decision

making, but they must �rst clean it of inconsistencies and transform it into a uniform format. Data cleaning

is one of the key challenges in data warehousing [4, 13]. Data transformation is also needed for extracting

data from legacy data formats, and for Business-to-Business Enterprise Data Integration (B-B EDI) where

two di�erent organizations want to access each other's data and need it to be in a common format. In

this paper, we present an interactive framework for data cleaning and transformation. We �rst look at the

drawbacks of current solutions.

1.1 Problems with Current Approaches to Data Cleaning

Data cleaning has 3 components: analyzing data to �nd discrepancies, choosing transformations to �x

these, and applying them on the data set. The current approach uses a combination of analysis tools and

transformation tools. The low-end solution uses data mining/machine learning algorithms (e.g. [27, 22,

3]) for �nding anomalies, and scripting languages like Perl for transformation. The sophisticated solution

uses \data quality analysis" tools like ACR/Data, Migration Architect [1, 12] for discrepancy detection, and

\data mapping" or \ETL" (Extraction/Transformation/Loading) tools like Data Stage, CoSort [11, 8] to

transform data. The user must run an analysis tool to �nd errors, then use an ETL tool to develop/apply

transformations, and repeat till the \data quality" is good enough. This method has three problems: lack

of interactivity, decoupling of transformation & anomaly detection, and clunky interfaces.

First, both discrepancy detection and transformation are typically implemented as batch, long-running

processes, operating on the whole dataset without giving any intermediate feedback. This leads to long

frustrating delays during which user has no idea if a transformation is e�ective.

Second, many iterations of this long process are needed because the data often has many hard-to-�nd

special cases. Transformation and discrepancy detection are typically done as separate steps, often using

1

Page 2: Potter's Wheel: An Interactive Framework for Data Cleaning and

separate software1. So users have to wait for transformation to �nish before they can check if it has �xed

all anomalies. More importantly, the decoupling makes it hard to �nd multiple discrepancies in one pass (as

we will see below, the existence of some errors makes others hard to �nd), thus forcing more iterations.

Third, languages used for specifying transformations are often quite non-trivial, whether they are scripting

languages like Perl or declarative languages like SchemaSQL [24]. Even ETL tools often support only some

restricted transformations between a small set of formats via a GUI, and require users to write programs using

conversion libraries for other transformations (e.g. Data Builder, Data Junction's CDI SDK and DJXL [9, 10]).

This worsens the �rst problem because errors in a program are not caught until the entire data has been

transformed and rechecked for discrepancies. Moreover, it is not easy to write \compensatory scripts" to

undo an erroneous transformation. Some transforms2 based on regular expressions cannot be undone in

general. So users need to maintain and keep track of multiple versions of the data. This is annoying,

especially for large datasets.

1.2 Potter's Wheel3

Approach

There is no magic algorithm to automate data cleaning. A solution needs a combination of good system

architecture and a good user interface, designed with the human-computer interaction in mind { human

input is essential in the analyze/transform loop, to act on discrepancies and select transformations.

In this paper, we present an approach that integrates transformation and discrepancy detection in a single

interactive framework. Users gradually build transformations by composing and debugging transforms one

step at a time on a spreadsheet-like interface shown in Figure 1 (the details of this interface will become clear

as we present the rest of the paper). Transforms are speci�ed graphically, their e�ect is shown immediately

on records visible on screen, and they can be undone easily if their e�ects are undesired. This interactive

approach improves accuracy of transformations, obviates writing complex programs, and frees the user from

the burden of managing multiple versions of data. Once the user has developed a satisfactory sequence of

transforms, she can ask the system to generate an optimized program (in C or Perl) that can be run on the

dataset as a batch, unsupervised process.

Discrepancy detection is done incrementally in the background, on the latest transformed version of the

data, thereby allowing users to develop and re�ne transformations as discrepancies are agged. By thus

integrating transformation and discrepancy detection, and making both interactive, we make data cleaning

a tight, closed loop with no long delays. Whereas, if transformation and discrepancy detection were separate

programs operating independently, only a few discrepancies can be caught in each analysis pass because the

existence of some discrepancies makes others hard to �nd. e.g., if names are stored as one �eld: \lastname,

�rstname" in some rows and as two separate �elds in other rows, a discrepancy detection algorithm is likely

to �nd only the format di�erence, and not catch dependency violations involving last names.

1.3 Transforms and their Optimization

Our goals for transforms are that they be easy to specify graphically, be exible enough to be applied

interactively, and at the same time powerful enough that most common transformations can be done without

user programming. We have chosen a set of transforms to balance these goals, and demonstrate that these

transforms can be used for transformations within data records as well as higher-order transformations that

1Even vendors like Ardent that provide both ETL and \quality analysis" software provide them as two pieces of a suite; the

user is expected to do discrepancy detection and transformation as separate stages [11].2We use transform as a noun to denote a single operation, and transformation as a noun to denote a sequence of operations.3Our technique for cleaning data resembles that of a potter molding clay on a wheel. The potter incrementally shapes clay by

applying pressure at a point, just as the user incrementally constructs transformations by applying transforms on example rows.

2

Page 3: Potter's Wheel: An Interactive Framework for Data Cleaning and

Figure 1: A snapshot of the Potter's Wheel User Interface

resolve schematic heterogeneities [28]. These transforms can also be used for conversions between some

nested (self-describing) and at data formats.

Potter's Wheel converts the entire sequence of transforms into a transformation program after the user is

satis�ed, instead of applying them piecemeal over many iterations. Users often specify/undo transforms in

an order that is intuitive to them, resulting in unnecessary or sub-optimal transformations. Hence the �nal

sequence of transforms can be optimized. We have begun studying some simple optimization techniques,

and present initial results in Section 5.

1.4 Discrepancy Detection

There are many data mining/machine learning algorithms (e.g. [27, 22, 3]) for �nding discrepancies in data.

However, domain-speci�c algorithms are also often needed, e.g. a user may want to �nd errors in chemical

formulae. Potter's Wheel aims to provide a simple, extensible framework in which di�erent algorithms can

be incorporated, to detect structural as well as semantic discrepancies. It allows users to plug in domain-

based algorithms that are automatically invoked on any value from a given domain, as well as �eld-speci�c

algorithms that are applied on selected columns. The framework supports incremental algorithms that run

on increasingly larger random samples of the data, �nding more discrepancies over time. This allows users

to gradually catch and �x discrepancies by applying transformations.

1.5 Outline

We describe the architecture of Potter's Wheel in Section 2. We develop a set of transforms that are easy

to specify and apply interactively, and yet powerful, in Section 3. We study various kinds of discrepancies

and develop an extensible detection mechanism in Section 4. In Section 5 we describe our initial work on

optimizing sequences of transforms. We look at related work in Section 6 and conclude in Section 7.

2 Potter's Wheel Architecture

The main components of Potter's Wheel architecture (Figure 2) are a Data Source, a Transformation Engine

that applies transforms along 2 paths, an Online Reorderer to support interactive scrolling and sorting at

3

Page 4: Potter's Wheel: An Interactive Framework for Data Cleaning and

the user interface [32, 31], and an Incremental Discrepancy Detector. We proceed to discuss these in turn.

2.1 Data Source

Potter's Wheel can accept data to be cleaned from an ODBC source or an ASCII �le. The ODBC source

can be used to access data from multiple tables via SQL queries, or even from multiple data sources via

middleware (e.g. [33]). Clearly, schematic di�erences between sources will restrict the tightness of the

integration via a query, as we will see in many examples in Section 3 (even Figure 1 shows poor mapping in

the Source and Destination columns). Potter's Wheel will �nd the areas of poor integration as discrepancies,

and the user can transform the data, moving data values across columns to unify the data format, as we

show in Section 3. In future we want to develop a graphical interface for specifying a query to fetch data

from multiple sources and to specify a mapping between their schemas (along the lines of Clio [17]), and

incorporate techniques for handling the \merge/purge" problem of catching approximate duplicates [7, 20].

When accessing records from ASCII �les, each record is viewed as a single large column. The user

can identify column delimiters graphically and split the record into constituent columns using the Split

transform of Section 3. Alternately, column delimiters can also be speci�ed in a metadata �le.

2.2 Transformation Engine

Transforms speci�ed by the user need to be applied in two places. First, they need to be applied to records

visible on screen. With the spreadsheet user interface this is done whenever the user scrolls or jumps to a

new scrollbar position. Since the number of rows that can be displayed on screen at a time is small, users

perceive transformations as being instantaneous (this clearly depends on the nature of the transforms; we

return to this issue in Section 3). Second, transforms need to be applied to records used for discrepancy

detection because, as argued earlier, we want to check for discrepancies on transformed versions of data.

However the system never changes the actual data records in this process; it merely changes the displayed

records. The most signi�cant advantage of this strategy is an UNDO capability. Many transforms (such as

regular-expression-based substitutions) have no compensating transforms, so modifying data in place would

make the changes permanent. UNDOs are crucial for users to easily experiment with di�erent transforms

without maintaining multiple copies of the data. Second, by not changing the records we avoid marking the

transforms that have been applied on each tuple (in order to avoid applying a transform twice). Finally,

collecting transforms and applying them on the dataset at the end allows us to optimize their application as

we describe in Section 5.

A disadvantage of this strategy is that we may transform the same tuple multiple times if the user scrolls

back and forth to the same point, but we only need to transform a screen-full of tuples at the scrolling

and thinking speed of the user. When the user is satis�ed with the transformation the system generates a

program to execute it on the dataset. Since the complexity of transformation is usually at least linear in the

data size, it will have to be a batch process for large data sizes. Our goal is to make interactive the process

of developing the right transformation for cleaning; the generated program is optimized for speed (Section 5)

and can be run unsupervised on the data. This program could also be used as a wrapper [33] on the data

source for subsequent accesses.

2.3 Interface used for Displaying Data

Our user interface is a Scalable Spreadsheet [31] that allows users to interactively re-sort on any column, and

scroll in a representative sample of the data, even over large datasets. The interface supports this behavior

using an Online Reorderer [32] that continually fetches tuples from the source and divides them into buckets

4

Page 5: Potter's Wheel: An Interactive Framework for Data Cleaning and

Inputdata

source

Side disk

TransformationEngine

IncrementalDiscrepancy Detection

SpreadsheetDisplay

specify/undotransforms

Online reorderer

scroll

check foranomalies

get page scrollbarposition

fetch

Figure 2: Potter's Wheel Framework

based on a (dynamically computed) histogram on the sort column, spooling them to a side-disk if needed.

When the user scrolls to a new region, the reorderer picks a sample of tuples from the bucket corresponding

to the scrollbar position and displays them on the screen 4.

We use this interface because it allows users to explore large amounts of data along any dimension. We

believe that, besides using specialized algorithms, users can also spot discrepancies by exploring example

data values and seeing their structure as a dimension changes. This interface also allows users to construct

transformations without waiting until the data has been completely fetched from the source.

2.4 Incremental Discrepancy Detection

A discrepancy detection algorithm must be incremental so that users can add transforms to �x discrepancies

as they are found. Any detection algorithm is fed a stream of transformed tuples, randomly sampled without

replacement from the original data set. This allows the algorithm to work in an incremental, probabilistic

fashion, with more and more discrepancies found over time (as in [22]).

Providing a continuous random stream: Often data sources are remote and provide only a single stream

of values. We assume that the tuples come in random (without replacement) order from the source; otherwise

the data would have to be randomly sorted �rst, or we would have to fetch data using a secondary index

built on a random number �eld. Note that this is not a constraint of our framework but is typically necessary

for discrepancy detection algorithms to give continually improving results. Hence tuples fetched from the

source are transformed and sent to the discrepancy detection algorithms, in addition to being sent to the

Online Reorderer. In Section 4 we explain how suitable discrepancy detection algorithms are applied to the

values in each tuple.

When the user asks to see discrepancies the system scrolls to a range having a discrepancy and highlights

the discrepant row on the screen.

3 Transforms Supported by Potter's Wheel

There is a tension between three desiderata for a set of transforms:

� Ease of specification: We want a transform to be simple and easy to specify via intuitive, GUI based

direct-manipulation [35] operations, and not use complex transformation languages.

4When a discrepancy must be agged to the user, we fetch a page having a discrepant row, not an arbitrary random sample

5

Page 6: Potter's Wheel: An Interactive Framework for Data Cleaning and

Transform De�nition

Format �(R; i; f) = f(a1; : : : ; ai�1; ai+1; : : : ; an; f(ai)) j (a1; : : : ; an) 2 Rg

Add �(R;x) = f(a1; : : : ; an; x) j (a1; : : : ; an) 2 Rg

Drop �(R; i) = f(a1; : : : ; ai�1; ai+1; : : : ; an) j (a1; : : : ; an) 2 Rg

Copy �((a1; : : : ; an); i) = f(a1; : : : ; an; ai) j (a1; : : : ; an) 2 Rg

Merge �((a1; : : : ; an); i; j; glue) = f(a1; : : : ; ai�1; ai+1; : : : ; aj�1; aj+1; : : : ; an; ai � glue� aj) j (a1; : : : ; an) 2 Rg

Split !((a1; : : : ; an); i; splitter) = f(a1; : : : ; ai�1; ai+1; : : : ; an; left(ai; splitter); right(ai; splitter)) j (a1; : : : ; an) 2 Rg

Divide �((a1; : : : ; an); i;predicate) = f(a1; : : : ; ai�1; ai+1; : : : ; an; ai; null) j (a1; : : : ; an) 2 R ^ predicate(ai)g [f(a1; : : : ; ai�1; ai+1; : : : ; an;null; ai) j (a1; : : : ; an) 2 R ^ :predicate(ai)g

Fold �(R; i1; i2; : : : ik) = f(a1; : : : ; ai1�1; ai1+1; : : : ; ai2�1; ai2+1; : : : ; aik�1; aik+1; : : : ; an; ail ) j(a1; : : : ; an) 2 R ^ 1 � l � kg

Filter �(R; predicate) = f(a1; : : : ; an) j (a1; : : : ; an) 2 R ^ predicate((a1; : : : ; an))g

Notation: R is a relation with n columns. i; j are column indices and ai represents the value of a column in a row. x andglue are values. f is a function mapping values to values. x� y concatenates x and y. splitter is a position in a string or a

regular expression, left(x; splitter) is the left part of x after splitting by splitter. predicate is a function returning a boolean.

Table 1: De�nitions of di�erent transforms. Unfold is de�ned in Section 3.4 since it is quite complex.

� Power: At the same time the transforms must be powerful and exible enough that most practical needs

can be met by composing a small number of transforms, without explicit user programming.

� Ease of interactive application: To avoid long delays and give users early feedback on the e�ect of

a transform we want to be able to apply transforms instantaneously to the records visible on the screen of

the spreadsheet.

Unfortunately existing transformation languages and visual interfaces address these goals only partially.

The research literature on declarative transformation languages (e.g. [24, 5]) looks at linguistic power but

typically not ease of speci�cation or interactive application. By contrast ETL tools that provide visual

transformation interfaces usually allow only a restricted set of transformations between a small set of input

and output formats, and other transformations must be programmed using conversion libraries(e.g. [9, 11]).

Moreover, these transformations are designed to run in batch mode and not for interactive application.

We will see below that these goals cannot all be met completely. To help make a judicious choice

of transforms reconciling these goals we divide them into: transformations of individual data values, one

row to one row mappings (vertical transforms), and mappings of multiple rows (horizontal transforms).

This classi�cation simpli�es our handling of the goal of interactive application. Clearly, individual value

transforms, vertical transforms, and one-to-many horizontal transforms can be performed interactively, one

row at a time. Hence they can be instantaneously applied to the records on the screen of the spreadsheet.

We discuss interactive application of transforms that map from multiple rows in Section 3.4.

We present below the transforms we provide in each class followed by a discussion of how well they meet

our goals. In the interest of readability our presentation is example-driven; formal de�nitions are given in

Table 1 and proofs of expressive power are give in Appendix A.

3.1 Format Column: Transformation of Individual Data Values

Format applies a function to the value of a column in every row. We choose functions that users are

familiar with and often use: regular-expression-based substitutions a la Perl (including back-references),

and arithmetic operations. To support higher-order transformations, we allow demoting of column and table

names into column values, using the special characters nC and nT in regular expressions, e.g. a value \George"

in a Name column can be Format-ed into a self-describing representation \<nName> George <Name>" using

the substitution \.*" to \<nnnC> n1 <nC>". Finally, we allow user de�ned functions (UDFs) to be applied

to handle situations involving complex types or specialized transformations (e.g. converting street name +

6

Page 7: Potter's Wheel: An Interactive Framework for Data Cleaning and

Merge

Split at ' '

Merge

Format'(.*), (.*)' to '\2 \1'

Bob

Jerry

Stewart

Dole

Stewart

Dole

Stewart,Bob

Dole,JerryDavis

Marsh

Anna

Joan

Anna

Joan

Davis

Marsh

Bob Stewart

Jerry Dole

Anna

Joan

Davis

Marsh

StewartAnna Davis

DoleJoan MarshJerry

Bob

Davis

Marsh

Anna

JoanJerry

Bob

Figure 3: Merge and Split in action

Divide (like ', ')

Anna Davis

Joan Marsh

Stewart,Bob

Dole,Jerry

Stewart,BobAnna

Dole,JerryJoan

Davis

Marsh

Figure 4: The Divide

transform in action

George

Anna

Bob

Name

George

AnnaBob

Math

Math

French

French

FrenchMath

65

43

54

42

7896

George 65 42Anna 43 78Bob 96 54

Name Math French

Figure 5: Higher order

di�erences in data

ZIP code to expanded form: ZIP code + 4).

Discussion of Format transform:

Ease of speci�cation: The UI for Format is simple { the user selects a column in the spreadsheet and speci�es

parameters for the formatting: these are \to expressions" and \from expressions" for regular expression

substitutions, in�x expressions for arithmetic operations, and external routines for UDFs.

Power: Clearly, UDFs allows Format to perform all transformations on individual data values.

3.2 Vertical Transforms: One Row to One Row Mappings

Vertical transforms are one-to-one mappings of tuples that are typically used for performing column opera-

tions needed to unify data collected from multiple sources into a common format.

Drop Column, Copy Column, Add Column: The role of Drop and Copy is self-evident. Add adds a new

column whose values can be set to a constant, a random number, or a serial number. The last is useful when

we need unique identi�ers for data merged data from di�erent sources.

Merge Columns with Glue, Split Column: Merge concatenates values in two columns interposing a

constant (the glue) in the middle to form a single new column. Split splits a column into two by specifying

either a position, or a regular expression (the split occurs at the �rst matching point).

These operators are particularly useful for handling schematic di�erences. For instance, suppose that

names are speci�ed in two columns FirstName and LastName in source A, and as one Name column with

a format of LastName, FirstName in source B, and that records from A and B have been merged together.

Assume for instance that the Name column has been mapped separately from the FirstName and LastName

columns as shown in Figure 3. The user could �rst Format names of the LastName, FirstName variety

into FirstName LastName, then Split them into two separate columns, and then Merge the corresponding

LastName and FirstName columns from the two sources.

Divide Column: Divide performs a conditional division of a column. Suppose that in the previous example

the Name column of source B has instead been mapped onto the FirstName column of Source A, as shown

in Figure 4. We want to apply Split only on the complete names. Hence we �rst apply Divide to vertically

divide this column into two using a predicate. Two new columns are created, the original values going into

the �rst or second column depending on whether they satisfy the predicate or not.

The motivation for Divide is to support conditional transformation, as is needed when logically di�erent

values (maybe from multiple sources) are bunched into the same column. We currently support arithmetic

and regular-expression-match based predicates. Without a separate Divide transform all other vertical

7

Page 8: Potter's Wheel: An Interactive Framework for Data Cleaning and

2 Formats (demotes)

Fold

Split

Name Math65439679

GeorgeAnnaBobJoan

42785487

French NameGeorgeAnnaBobJoan

Math:65Math:43Math:96Math:79

French:42French:78French:54French:87

GeorgeGeorgeAnnaAnnaBobBobJoanJoan

NameMath

FrenchMath

FrenchMath

FrenchMath

French

6542437896547987

George

Anna

Bob

JoanJoan

Bob

AnnaGeorge

NameMath:65

French:42

Math:79French:87

Math:96French:54

Math:43French:78

Figure 6: Fold-ing to resolve higher-order di�erences

1. Let c1; c2; : : : ; ck be columns whose

column names are data values

2. For each ci c1; c2; : : : ; ck do f

3. Divide ci into ci and c0i based on whether the

column name has been already demoted.

4. Format ci to demote the column name.

5. Merge ci, c0i6. g

7. Fold c1; c2; : : : ; ck

Figure 7: Algorithm to atten the data and resolve

higher-order di�erences using Fold

transforms would have to accept predicates, complicating the GUI for their speci�cation.

Discussion of Vertical Transforms:

Ease of speci�cation: The GUI for vertical transforms allows users to specify transforms by selecting appro-

priate columns/rows, choosing a transform from a drop-down menu, and then entering suitable parameters.

We could change our interface to support speci�cation of Merge, Drop, Add, and Copy transforms using

direct-manipulation [35] operations like dragging and dropping columns. However, Split and Divide by

pattern intrinsically need users to specify regular expressions, necessitating textual input. In future we in-

tend to study allowing users to specify these by example. For instance, we can let users show a splitting

position on a sample set of values and infer a regular expression or position.

Power: Completeness of transforms: The completeness of our vertical transforms for all one-to-one mappings

of tuples arises from the power of the Format. We can Merge all columns with a suitable glue, Format the

result, and Split it using the glue, to perform any one-to-one mapping of tuples.

Power: Minimality of transforms: Format, Merge and Split are functionally complete, and hence Divide,

Add, Drop and Copy can be written in terms of these. We provide them separately for two reasons. First,

many operations are more naturally speci�ed via these operations. e.g. Drop-ing a column is much simpler

than Merge-ing it with another column and then Format-ing it to remove the unnecessary part (Appendix A

formalizes this idea). Second, specifying transforms directly via these operations permits many performance

optimizations in their application that would not be possible if they were done opaquely as a UDF-based

Format. We will discuss these optimizations in Section 5.

3.3 Horizontal Transforms 1: One-to-Many Mappings

Horizontal transforms help to tackle higher-order schematic heterogeneities [28] where information is stored

partly in data values, and partly in the schema. Figure 5 shows a case where a student's grades are listed as

one row per course in one schema, and as multiple columns of the same row in another. We describe simple

one-to-many transforms in this section and deal with many-to-many transforms in the next section.

Fold Columns

Fold converts one row into multiple rows, folding a set of columns together and replicating the rest, as

de�ned in Table 1. Figure 6 shows an example with student grades where the subject names are demoted

into the row via Format, grades are Folded together, and then Split to separate the subject from the grade.

8

Page 9: Potter's Wheel: An Interactive Framework for Data Cleaning and

Fold

George, AnnaJoan

John, Bob

MembersLatimerSmithBush

Family

Anna

Bob

GeorgeJoanJohn

LatimerSmithBush

Family

George

Anna

Latimer

Latimer

Family

JoanSmith

Smith

JohnBush

BobBush

Split

Figure 8: Folding a set of values into a single

column; the user has to �lter out the nulls

later using Filter

Math:65

Math:43

George

Anna

French:42

French:78

History:98

History:32

History:32

History:32

History:98

History:98

Math:65

French:42

Math:43

French:78

George

George

Anna

Anna

History:98

History:32

History:98

History:32

Math:65

French:42

Math:43

French:78

George

George

Anna

Anna

George

George

Anna

Annafold

fold

Figure 9: A series of two-column folds will not fold

three columns together: note the duplicate History

records

Fold is similar to the Fold restructuring operator of SchemaSQL [25], except that we do not automatically

demote column names in Fold. Although demote and Fold are often done in succession to resolve schematic

heterogeneities, we choose not to bundle these operations together because there are many situations where

there may not be a meaningful column name. e.g. columns formed by transformations have no column name,

and neither do columns containing expanded representation of sets. Figure 8 shows an example where Fold

without demote is needed. If the user wants, Fold and demote can be made into a macro as we describe in

Section 3.6.

Note that the ability to fold arbitrarily many columns in one operation is crucial, and cannot be performed

using a series of two-column folds because it leads to incorrect duplicate semantics as shown in Figure 9.

Filter Rows

Filters are 1-to-1/0 mappings used to eliminate irrelevant records or unwanted by-products of other trans-

forms. We currently support arithmetic and regular-expression-based predicates, as well as UDFs.

Discussion of One-to-Many Transforms:

Ease of Speci�cation: Fold is speci�ed by selecting the columns to be folded and choosing Fold from the

transform menu. A Filter is speci�ed either by explicitly entering a predicate, or by selecting a value on

the screen and specifying a boolean operator w.r.t it (e.g. the user can click on 0 in a spreadsheet cell and

choose > to select rows with positive values for that column).

Power: The combination of Fold, Filter, and vertical transforms allows us to perform all one-to-many

mapping of rows, as we prove in Appendix A. Another use of Fold is to atten a table as shown in Figure 6.

This convert it to a form where the column and table names are all literal names and do not contain data

values: this notion is formally de�ned in [25]. In general, some rows may already have column names

demoted, and a attening technique that handles this is given in Figure 7.

3.4 Horizontal Transforms 2: Many-to-Many Mappings

The most general transforms map multiple rows into one or more rows. These are fundamentally hard to

apply to tuples visible to the user on the screen within an interactive response time, because for each row,

we need to �nd all its \companion rows" before we can transform it. However one such transform is quite

useful for higher-order transformations.

Unfold Columns

UnFold is used to \un atten" tables and move information from data values to column names, as shown in

Figure 10. Interestingly, Unfold is not the exact inverse of Fold. Fold takes in a set of columns and folds

them into one column, replicating the others. Unfold takes two columns, collects rows that have the same

9

Page 10: Potter's Wheel: An Interactive Framework for Data Cleaning and

values for all the other columns, and unfolds the two chosen columns. Unfold needs two columns because

values in one column are used as column names to align the values in the other column.

Formally, Unfold(T; i; j) on the i'th and j'th columns of a table T with n columns named c1 : : : cn (a

column with no name is assumed to have a NULL name) produces a new table with n + m � 2 columns

named c1 : : : ci�1; ci+1 : : : cj�1; cj+1 : : : cn; u1 : : : um where u1 : : :um are the distinct values of the i'th column

in T. Every maximal set of rows in T that have identical values in all columns except the i'th and j'th,

and distinct values in the i'th column, produces exactly one row. Speci�cally, a maximal set S of k rows

(a1; : : : ; ai�1; ul; ai+1; : : : ; aj�1; vl; aj+1; aj+2; � � � ; an) where l takes k distinct values in [1::m], produces a

row (a1; � � � ; ai�1; ai+1; : : : ; aj�1; aj+1; aj+2; : : : ; an; v1; v2; � � � ; vm). If a particular vp in v1 : : : vm does not

appear in column j of the set S, it is set to NULL. Values of column i, the Unfolding column, are used to

create columns in the unfolded table and values of column j, the Unfolded column, �ll in the new columns.

Unfold is exactly the same as the Unfold restructuring operation of SchemaSQL [25]. However since

Unfold automatically promotes column names we cannot use it to restructure set valued attributes (Figure 11

gives an example of the desired restructuring) since there is no explicit column name. Due to space constraints

we relegate description of a variant of Unfold that handles this issue to Appendix B.

Implementing Unfold in a generated program is simple; it is much like the group-by functionality of SQL.

We sort by the columns other than the Unfolding and Unfolded columns and scan through this sorted set,

collapsing sets of rows into one row.

Discussion of Many-to-Many Transforms:

Ease of speci�cation: Unfold is simple to specify graphically. The user can select the unfolding and unfolded

columns and then choose Unfold from a menu of transforms.

Power: Fold allows us to atten our tables into a common schema where all the information is in the columns,

thereby resolving schematic heterogeneities. Unfold allows us to reconvert the uni�ed schema into a form

where some information is in column names. Fold and Unfold are essentially the same as the restructuring

operators of SchemaSQL, and the only restructuring operators of SchemaSQL we miss are Unite and Split

that are used for manipulating multiple tables. For a more detailed analysis of the power of Fold and Split

for (un) attening tables, and also for application to OLAP, see [25, 16].

Ease of interactive application: We do not currently implement Unfold in a visual manner at the user

interface due to two problems. First, when we want to display a row in the user interface we need to �nd a

set of matching rows. This will involve, in the worst case, waiting for a complete scan of the dataset. Second,

we don't even know what or how many columns are going to be there in the output { this depends on the

number of distinct values there are in the Unfolding column.

We could work around these problems by not displaying a complete row but instead displaying more and

more columns as distinct values are found, and �lling data values in these columns as the corresponding input

rows are read. However we feel that progressively adding more and more columns in the spreadsheet interface

would confuse the user. We plan to avoid the problem of asynchronous column addition by implementing an

abstraction (column roll up) interface where, upon an Unfold, all the newly created columns are shown as one

rolled up column. When the user clicks to unroll the column it expands into a set of columns corresponding

to the distinct values found so far.

3.5 Summary of Power of Transforms

Our transforms support all one-to-many transformations of rows in a table, as we prove in Appendix A. They

also support moving information between schema and data, and can be used for attening and un attening

tables. We believe that these transforms can also be used in some cases for converting data between nested

10

Page 11: Potter's Wheel: An Interactive Framework for Data Cleaning and

Name

George 65 42

Anna 43 78

Bob 54

Joan

Name Math French

96

79

EnglishGeorge

George

Anna

Anna

Bob

Bob

Joan

Math

French

Math

French

English

French

English

65

42

43

78

96

54

79

unfold(2,3)

Figure 10: Unfold-ing into three columns

UnfoldSet

GeorgeAnna

LatimerLatimer

Family

JoanSmithMarySmithBobLatimer

members

BobMary

GeorgeAnnaJoan

LatimerSmith

Family

Figure 11: Unfolding a set of values, without

an explicit column name to align

or self-describing formats and at formats. For example, a row with multiple columns can be converted

into a nested structure by demoting the column names as tags (as in the example of Section 3.1), unfolding

set-valued attributes distributed in multiple rows (as in Figure 11), and merging all attributes that are in

separate columns. We intend to formally study the power of our transforms for conversions involving nested

formats like XML.

3.6 Macros: Programming a Transformation by Example

However carefully we choose our transforms some applications may need a long and laborious sequence

of the basic transforms. To automate this process we allow users to program such a transformation \by

example", by forming macros out of a sequence of basic transforms. A user selects a set of columns as input

parameters to a macro and then applies a series of transforms on those columns. These can subsequently

be invoked directly using the macro. The most useful macros are simple type constructors that transform

a column having discrepancies in values from a domain, into a clean, uni�ed format. For instance a Date

Type Constructor macro can hold a sequence of transforms needed to unify dates in di�erent formats.

4 Framework for Discrepancy Detection

In this section we present a classi�cation of various discrepancies that could arise in data, and discuss how

these are incrementally detected. We also describe the API for integrating custom discrepancy detection

algorithms into our framework.

We need to make discrepancy detection incremental so that users can clean data interactively. We

aim to �nd large classes of discrepancies early (such as format di�erences arising because of merging data

from di�erent sources), and �nd rarer, smaller classes of discrepancies gradually. Very small classes of

discrepancies, i.e. outliers, are intrinsically hard to detect. Our framework supports probabilistic algorithms

where the chance of missing an outlier decreases with time. As in Online Aggregation [19], we leave it to

the user to decide, based on the error probabilities, when to stop checking for discrepancies.

Our framework also handles batch algorithms that produce all tbe discrepancies at the very end, since

these may be the only (or signi�cantly more e�cient) way of �nding some kinds of discrepancies. The error

probability for such an algorithm remains 100% until it �nishes, and sharply drops to 0 at the end.

4.1 Types of Discrepancies

There are three main forms of discrepancies: structural, schematic, and constraint violations.

Structural discrepancies: These are discrepancies in the structure of individual �elds in the data. For

instance dates may be stored as \Month date, year", \MM/DD/YY", and \MMDDYYYY" in di�erent records.

Schematic discrepancies: Schematic discrepancies arise due to poor schema mapping when merging data

from multiple sources. They typically cause the value of a �eld to be null in some rows and non-null in

11

Page 12: Potter's Wheel: An Interactive Framework for Data Cleaning and

others. For instance, Name could be stored as two �elds in some rows and as one �eld in others, as shown in

Figures 3 and 4. Higher order variations in the data (as in Figure 5) also result in schematic discrepancies.

Domain constraint violations: Domain constraint violations can be of two types: single-row or multi-

row. Single-row violations are those where the value of a �eld in a tuple directly violates the constraints of

the �eld's domain. Multi-row violations are those where the values of �eld(s) in two or more rows violate

a constraint; each row is individually correct. Examples of these include violations of range constraints on

attributes like age, and violations of functional dependencies, respectively.

Of the above three categories, structural and schematic di�erences are typically easy to detect incremen-

tally. We simply extract the structure of one value in a column, and compare the structures of other values

(in that column) to it. We explain more about structure extraction in Section 4.3. In the next two sections

we describe our framework for detecting domain constraint violations.

4.2 Support for Probabilistic Incremental Discrepancy Detection

Single-row violations like spelling errors are simple to detect incrementally since we need only a constant

amount of memory to handle the current row. However many important discrepancies, like functional depen-

dency violations and uniqueness violations, are multi-row and are hard to �nd incrementally. Deterministic

algorithms for detecting multi-row discrepancies typically take time and space super-linear in the data size

before producing any results (e.g see [27]).

However these discrepancies can typically be detected probabilistically in an incremental fashion, with

the probability of a missed outlier decreasing with time (e.g. [22]). To aid such probabilistic methods, a

discrepancy detection algorithm is fed a continuous random stream of tuples. To detect multi-row constraint

violations an algorithmmust incrementally build state about previously seen tuples to approximate properties

of the domain, and compare new tuples against this state. Since data sizes are usually much larger than

memory sizes, the stored state is only approximate and can often re ect the properties of only a window of

recently seen tuples. For example, to �nd duplicates, an algorithm could maintain a hash table (or bloom

�lter) of recently seen tuples.

4.3 Extensible Discrepancy Detection

We believe that domain constraint violations are best detected using domain-speci�c algorithms. Therefore

we provide only some elementary algorithms as defaults, and rely on powerful extensibility to allow users to

plug in custom algorithms. An extensible framework must handle several issues.

� We want to allow users to check discrepancies on logical domains rather than particular �elds. For

instance, an application may have a particular way of specifying month names, and this algorithm must be

automatically applied to detect discrepancies on any �eld that contains month names.

� Values falling in such domains may exist as sub-components of a �eld. For instance, a date column can

have a format Month Date, Year. We want to allow users to break down a �eld into sub-components in

custom ways, but also have a default that will handle many common structures.

� In contrast to the previous point, discrepancies may exist only in the context of a complete �eld. e.g.,

although a date �eld is made up of three sub-components, some dates may be invalid except in a leap year.

� Extending the third point, discrepancies such as functional dependency violations arise in the context of

multiple �elds only.

In order to handle these issues, we need two mechanisms for discrepancy detection. First, users must be

allowed to specify algorithms that apply on a particular �eld or a set of �elds. Second, users must be able to

specify an algorithm on a domain, and have the system automatically decompose a value, infer a domain for

12

Page 13: Potter's Wheel: An Interactive Framework for Data Cleaning and

each sub-component, and apply an algorithm appropriate for the domain. We brie y talk about structure

extraction before describing these two mechanisms.

Structure Extraction: Di�erent ways have been proposed to extract structure from data [29, 2]. In our

system we use the simple mechanism of splitting a value into sub-components separated by delimiters, where

the delimiters consist of punctuation marks and white spaces. In future we plan to extend our system to use

structure descriptions that accompany values (such as XML DTDs).

Potter's Wheel provides two mechanisms for discrepancy detection.

Type-based Discrepancy Detectors (TDDs)

A TDD is an algorithm speci�c to a particular type that detects discrepancies in values of this type.

Potter's Wheel provides default TDDs for simple types and allows users to register TDDs for special types,

like Dates, or Chemical Formulae. To check values in a certain sub-component of a certain �eld, the system

uses a TDD registered for the type that matches these values (if any is registered). The system ags as a

structural discrepancy any variation among di�erent rows in the number, or types, of sub-components for a

particular �eld.

The system chooses one row as a chosen row. For each �eld in this chosen row the system extracts the

structure and infers the types of its sub-components. These types are used to choose the TDDs to be applied

on each sub-component5. If a value falls into more than one type, the system �rst gives priority to user

registered TDDs over default TDDs, and if there still are multiple matching types, asks the user to resolve.

In every subsequent row, the structure of each �eld and the types of its sub-components are extracted and

compared against the structure of the corresponding �eld in the chosen row. Any inconsistency is agged as

a structural discrepancy. Next, the chosen TDDs are applied to the values of every sub-component of every

�eld in this row.

A Type-based Discrepancy Detector must implement the following API:

� boolean InferType(char* atomic value): This function is run on all the sub-components of all �elds in

the chosen row, and determines whether a given value falls within the domain for this TDD. If the value of a

particular sub-component in the chosen row matches the type for a registered TDD, the following functions

are applied on the value of this sub-component on all rows.

� void UpdateStats(char* atomic value): This updates the internal state of the TDD with a new value.

� boolean CheckDiscrepancy(char* atomic value): This indicates whether the current value is an outlier

based on the state accumulated so far6.

� float Confidence(int data size): This returns the probability that all discrepancies have been found,

and typically needs to know the total number of tuples in the data set (e.g. see [22]).

User-speci�ed Discrepancy Detectors (UDDs)

A UDD is a discrepancy detection algorithm that the user asks the system to apply on a speci�c set of

�elds. The API for a UDD is the same as for a TDD except that there is no need for an InferType function,

values in the chosen �elds are passed directly to the UDD.

Example of TDDs and UDDs

Consider the table shown in Figure 12 containing student information. Suppose that the user has registered

a Number TDD that maintains as internal state the mean and standard deviation of values (numbers) seen

so far, and ags any value that is more than 10 standard deviations from the mean as an anomaly. He has

5Note that it does not matter if the chosen row itself is an anomaly, since we always show both the anomalous row and the

row w.r.t which it is inconsistent as discrepancies, thereby allowing the user to correct either one.6A batch discrepancy detection algorithm must provide an extra function that outputs the list of discrepancies at the end

(when Confidence() becomes 100%).

13

Page 14: Potter's Wheel: An Interactive Framework for Data Cleaning and

Name

GeorgeAnnaBob

Dept. Name

Math

PotteryEnglish

Dept. Id

42

8378

Date of Birth

09/10/1980

01/05/1983May 24, 1979

Joan Biology 70 07/11/1980Jim Biology 70 09/11/19978

Figure 12: A student information table with discrepancies

also registered a String TDD that matches strings of alphabets, but does not do any discrepancy checking.

He has also speci�ed a UDD on the Dept. Name column to check validity of department name, and a UDD

on Dept. Id and Dept. Name to check the functional dependency that one Dept. Id corresponds to exactly

one Dept. Name.

Assume that the row chosen by the system to extract structure and infer types is George's record.

The structures inferred are Name: String, Dept. Name: String, Dept Id: Number, Date of Birth: Num-

ber/Number/Number. Potter's Wheel could use the Dept. Name UDD to �nd \Pottery" as an anomaly

{ the UDD \knows" the set of valid departments. The date May 24, 1979 will be caught as a structural

discrepancy because its structure is extracted as String Number, Number. The system will separately in-

voke the Number TDD on the MM, DD, and YY sub-components of Date of Birth, and will this catch the

huge year 19978 as a discrepancy because it is too many deviations away from the mean value of the YY

sub-component.

Default Discrepancy Detectors in Potter's Wheel

We currently provide only simple TDDs for integers and strings. The integer TDD checks for values that

are more than 10 standard deviations from the (currently estimated) mean. The string TDD �rst checks for

spelling using the ispell [23] library. If the spelling check fails, it does a typo check. The TDD maintains a

hash table of the last 100 distinct values seen, and ags a value as a possible error if it has a spelling error

and is within Hamming distance of 2 from any of the previous values. Some of the parameter choices are

quite ad-hoc, but applications can customize them by adding other TDDs. We plan to add a default UDD

for detecting gray functional dependencies [15] and their violations using algorithms similar to [22].

5 Optimization of a Sequence of Transforms

The goal of Potter's Wheel is to let users specify transforms as they are needed | often only when discrep-

ancies are found | in an order that is natural to the user. Hence the resultant transformation often has

redundant or sub-optimal transforms. As we will discuss below, executing such a sequence of transforms

exactly in the order speci�ed by the user can be quite ine�cient. Hence we want to convert the sequence

of transforms speci�ed by the user into one that is more e�cient for execution, and compile them into an

optimized program to perform the transformation on the database. We have begun investigating di�erent

ways of optimizing sequences of transforms, focusing �rst on simple optimizations that give good bene�t.

We present initial results in this section.

5.1 Optimization Setting

We consider only optimizations of transformations having one-to-one (Vertical) transforms only. Since such

a transformation needs to be applied once for each row in the dataset we want to execute it e�ciently. Thus,

our granularity of optimization is a row-to-row mapping.

Our main concern in this optimization is CPU time. Since I/O and tuple transformation are pipelined,

14

Page 15: Potter's Wheel: An Interactive Framework for Data Cleaning and

materialize

mergesplit by pos. 4

split by '++'

materialize

OutputField 2

materialize

format: 'abc' to 'xyz'

OutputField 1 OutputField 3

InputField 1 InputField 2

Figure 13: An example of a transformation that can

be optimized

OutputField 1

InputField1 with splitpositions calculated

mem

cpy memcpy

format: 'abc' to 'xyz'

OutputField 2

InputField2

OutputField3

mem

cpy

Figure 14: Optimized execution of the

transformation

and we do only large sequential I/Os, the I/O cost is masked by the CPU cost. CPU time is composed

mainly of memory copies, bu�er (de)allocations, and regular expression matches. Format and Split may

involve regular expression matches. However almost any transform, if implemented naively without looking

at the transforms that come before and after it, will have to (a) allocate a suitable sized bu�er for the output,

(b) transform the data from an input bu�er to the output bu�er, and (c) deallocate the output bu�er after

it has been used. This bu�er is often of variable size (because transforms such as Merge, Split, and Format

change data sizes and the data itself could be of variable size) and so must be allocated dynamically. With

this approach, transforms like Merge and Split involve memory-to-memory copies as well. We use the term

materialization to refer to the strategy of creating a bu�er for the output of a transform.

Consider the example transformation shown in Figure 13 with materialization done between all successive

transforms. The ovals represent transform operators and the rectangles represent materialization. The graph

is executed bottom up, each operator executing when all of its inputs are ready. Three materializations are

needed if the graph is executed naively, but these can be completely avoided as shown in Figure 14. The two

Split's can be done with one pass over the bu�er InputField1 to �nd out the split positions, and pointers to

these substrings can been given as input to Format and copied to OutputField1. Similarly the third substring

after the split can directly placed at the beginning of OutputField3.

In this paper we aim to reduce the time spent on memory copies and (de)allocations by minimizing

materialization; we do not optimize regular expression matches. In the rest of this section we generalize

the approach of Figure 14: we decide how to minimize the materialization needed, and how to compile an

optimized graph with minimal materialization into a program.

5.2 Determining a Minimal Set of Materialization Points

Some transforms impose constraints on the materialization of their inputs and/or outputs. Arbitrary UDFs

and Format need their inputs to be materialized7 and produce outputs that are materialized (because of

the requirements of our regular-expression library and our arithmetic expression parser). Split by regular

expression needs to have a materialized input but need not materialize its output; it can stop at determining

the split positions and directly pass on the bu�er to its outputs.

Materialization can also be done for performance. Materializing before a Copy can be used to avoid

reapplying the transforms below the copy to generate each of its outputs. This problem is akin to that

7We say that a value is materialized if it is present in a single contiguous memory bu�er.

15

Page 16: Potter's Wheel: An Interactive Framework for Data Cleaning and

Merge

Split (position 1, position 2, ...)

Transform

Format (to expr.,from expr.)

Split (regular expr. 1, regular expr. 2 ...)

LLBS A1

Input

singleton LLBS A1

singleton LLBS A1

LLBSs A1..An, onefrom each input

LLBS output1, LLBS output2, ...

Output

singleton LLBS having format(A1)

singleton LLBSs output1, output2 ...

LLBS flatten(A1,A2,..An)

k-way Copy

Output (output buffer)

Add (constant/serial/random)

Input

LLBS A1

LLBS A1

string A1

input buffer A1

LLBS A1, LLBS A1, ... k times

concatenate A1 in output buffer

A1 as a singleton LLBS

A1 as a singleton LLBS

Divide (predicate) singleton LLBS A1 A1 if predicate satisfied, else null

Materialize LLBS A1concatenate A1 in new o/p bufferand return this as a singleton LLBS

Figure 15: Operations performed by transforms. A singleton LLBS contains exactly one LBS.

of deciding whether to cache the results of part of a query in multi-query optimization [34, 30]. Ideally

we want to materialize only if the cost of materialization is less than the (optimal) cost of performing the

transforms before the copy to produce all the outputs. However determining the optimal cost of performing

these transforms is very di�cult since it depends on the (optimal) location of other materialization points

both before and after the copy [34, 30]. Currently we adopt a heuristic of always inserting a materialization

point before a Copy, except when there are no transforms before it.

We add to the transformation graph only those materializations that are needed to meet these constraints.

5.3 Restructuring the Transformation Graph

After inserting the minimum materializations needed, we simplify the transformation graph using some

simple restructuring operations. We coalesce successive Merges into a single Merge and coalesce successive

Splits into a single multi-output Split. The resultant Split is parameterized by the ordered combination

of the splitting conditions of the constituent Splits. If a regular-expression-based Split comes immediately

after a position-based Split, we do not coalesce them together because that would force the materialization

of a bigger string before the coalesced Split as compared to not coalescing them and materializing after the

position-based Split only. We also remove from the transformation graph all nodes whose only eventual

ancestors are Drops.

5.4 Generating Optimized Code for Transforms

We aim to generate code for a transformation graph that never allocates or copies a bu�er except at mate-

rialization points. We perform a bottom up traversal of the transformation graph. Each node is a task that

needs to be performed; performing the task corresponds to looking at its input, applying the transform (or

materializing if it is a materialize node) and then \passing on" results to the nodes above it. A node can �re

only when all its inputs are available, so we maintain a queue of nodes that are ready to �re and repeatedly

pick a node from it to �re. A node enters the queue when all its inputs are available.

We need a way for nodes to \pass on" their transformed results to the nodes above them without copies.

Passing a pointer to a single bu�er fails because operations like Merge combine multiple bu�ers. Hence the

mechanism we use for \passing data" is an ordered list of length based strings (LLBS). Each length based

string (LBS) consists of a pointer to a bu�er along with the length of the bu�er, and is typically a window

in a larger null-terminated string.

16

Page 17: Potter's Wheel: An Interactive Framework for Data Cleaning and

T3T2T1 T5T4 T6 T8T7Transformation applied on data set

Avg

. tim

e to

tran

sfor

m a

reco

rd (

usec

s)

10

20

30

40

���������������������������������������������������������������

��Optimized C��� C Perl

656 us 683 us 612 us 681 us 604 us 673 us 750 us 737 us

Transformation Constituent Sequence of Transforms

T1 Split Date at position 5

T2 Split Source by 'to'

T3 Merge Source, DestinationT4 Split Source by 'to', Merge the right part with Destination

T5 Add constant 'Foo Bar', Merge result with Destination

T6 Merge Source, Destination, Split result at position 4

T7 Split Source by 'to', Merge right part with Destination, Split Date by '/',

Format resulting years with '19998' to '1998'

T8 Split Source by 'to', Merge right part with Destination, Format result with 'to ' to "

Split Date at position 5, Copy Delay

Figure 16: Time taken for di�erent transformations. The gain due to optimization increases with the number

of transforms

Figure 15 gives the operations performed by (the code generated for) each transform along with its input

and output. By operating only on LLBSs, our transforms never have to materialize except at the points

described in Section 5.2. Due to lack of space we explain only the interesting ones. Merge accepts an LLBS

from each input and outputs a \ attened" LLBS, e.g if the inputs are (lbs1, lbs2, lbs3) and (lbs4, lbs5), Merge

outputs (lbs1, lbs2, lbs3, lbs4, lbs5). A Split based on position takes in a LLBS and further re�nes it based

on the new split positions e.g. Split by postion 5 of the LLBS (\abcdef", \ghijklmn") produces two output

LLBSs (\abcde") and (\f", \ghijklmn").

5.5 Advantage of Compilation into Programs with Minimum Materialization

Figure 16 compares the average time taken for transforming a row using programs generated by Potter's

Wheel for 8 transformations. We study the times taken for Perl and C programs generated without the

optimizations of this section, and C programs generated with the optimizations of this section. These

programs correspond to di�erent transformations of a ight statistics data set8. All the generated C programs

were compiled with the highest optimization settings in the Visual C++ 6.0 compiler.

We run 8 di�erent transformations, ranging from single transforms to long sequences. We see that the

8We used data downloaded from FEDSTATS [14] for ights originating from Chicago O'Hare, San Fransisco, and New

York JFK airports. The columns in this dataset are shown in Figure 1, and it has 952771 records, with a total size of

73:186 MB. The schema for this dataset (shown in the user interface of Figure 1) is Delay:Integer, Carrier:Varchar(30), Num-

ber:Char(5), Source:Varchar(10), Destination: Varchar(5), Data: Char(13), Day:Char(3), Dept Sch:Char(5), Dept Act:Char(5),

Arr Sch:Char(5), Arr Act:Char(5), Status:Char(10), Random:Integer. We ran experiments on a 400 MHz Intel Pentium 2 pro-

cessor with 128 MB memory running Windows NT 4.0. We accessed this data from an ASCII �le data source.

17

Page 18: Potter's Wheel: An Interactive Framework for Data Cleaning and

optimizations described above give a speedup that varies from about 15% for single transforms like T1,T2,

and T3, to about 75% for sequences of many transforms like T7 and T8. The performance gain is reduced

by the time needed to parse the input and copy the �elds not involved in the transformation from the

input to the output. The C programs are about an order of magnitude faster than the Perl programs. If

transformations require programming by the user, scripting languages like Perl are likely to be chosen due

to their ease of rapid programming.

6 Related Work

A nice description of the commercial data cleaning process is given in [4]. There are many commercial

ETL(Extraction/Transformation/Loading) tools (also known as \migration" or \mapping" tools) that sup-

port transformations of di�erent degrees, ranging from tools that do default conversions between a small

set of common formats (e.g. WhiteCrane, Cognos [37, 6] to fairly general tools that support a wide variety

of transformations (e.g. Data Junction, Data Builder [10, 9]. Many of these tools provide a visual interface

for specifying some common transformations, but typically require users to program other transformations

using conversion libraries (e.g. Data Junction's CDI SDK and DJXL [10]). Moreover, these tools typically

perform transformations as long running batch processes, so users do not early feedback on the e�ectiveness

of a transformation.

Complementing the ETL tools are data quality analysis tools (a.k.a. \auditing" tools) that analyze the

data to �nd discrepancies. This operation is again long-running and causes many delays in the data cleaning

process. Typically these tools are provided by di�erent vendors (from those who provide ETL tools) e.g.

Migration Architect, Data Stage,ACR/Data [12, 11, 1]. Some vendors like Ardent and Evoke provide both

ETL and analysis tools, but as di�erent components of a software suite, leading to the problems described

in Section 1.1.

Many scripting languages such as Perl and Python allow extremely powerful (essentially Turing complete)

transformation. A problem with these languages is that, even though they are not too complicated in

themselves, writing scripts is di�cult and error-prone. This is due to a poor integration with exploration

or discrepancy detection mechanisms. Often, only after executing the script does one �nd that it does not

handle all the cases that arise in the data.

In recent years there has been much e�ort on integration of data from heterogeneous data sources via

middleware (e.g. Garlic [33]). These e�orts do not typically address the issue of errors or inconsistencies

in data, or of data transformations. Recently, Haas et. al. are working on a tool to help users match the

schema in these heterogeneous databases and construct a uni�ed view [17]. We intend to extend Potter's

Wheel in a similar manner with an interactive way of specifying rules for merging of data from multiple

sources.

There has been much work on languages and implementation techniques for performing higher-order

operations on relational data [5, 24, 25, 28]. Our horizontal transforms are very similar to the restructuring

operators of SchemaSQL [25].

There has been some algorithmic work on detecting deviations in data [3], on �nding approximate du-

plicates in data merged from multiple sources [20], and on �nding hidden dependencies and their viola-

tions [21, 27]. Many of these algorithms are inherently \batch" algorithms, optimized to complete as early

as possible and not giving any intermediate results. There are a few sampling based approaches that are

incremental [22]. However these are not integrated with any mechanism to �x the discrepancies.

Nodose [2] is a system for extracting structure from semi-structured data like web pages. Currently our

18

Page 19: Potter's Wheel: An Interactive Framework for Data Cleaning and

Incremental Discrepancy Detector decomposes a value into components based on punctuation. We plan to

make this more sophisticated using some similar techniques.

7 Conclusions and Future Work

Data cleaning and transformation are important tasks in many contexts such as data warehousing and data

integration. The current approaches to data cleaning are time-consuming and frustrating due to long-running

noninteractive operations, poor coupling between analysis and transformation, and complex transformation

interfaces that often require user programming.

We have developed a simple yet powerful framework for data transformation and cleaning. Combining

discrepancy detection and transformation in an interactive fashion, we allow users to gradually build a

transformation to clean the data by adding transforms as discrepancies are detected. We allow users to

specify transforms graphically, and show the e�ects of adding or undoing transforms instantaneously, thereby

allowing easy experimentation with di�erent transforms.

We have developed a set of transforms that are easy to specify in a graphical manner and are still quite

powerful, handling all one-one and one-to-many mappings of rows as well as higher-order transformations.

Since the transformation is broken down into a sequence of simple transforms we are able to perform detailed

optimizations when we compile them into a program. We have a mechanism for discrepancy detection that

is general and easily extensible. It can handle both algorithms that are to be applied to any value that

falls in a domain, and algorithms that are to be applied on speci�c �elds. We provide some simple default

algorithms for common domains, which are fairly e�ective in catching common discrepancies.

An important direction for future work is an interactive way of resolving approximate duplicates in data

from multiple sources, where the values of identifying �elds (such as Name) do not match exactly [7, 20].

Resolution of these duplicates can often only be done by the user, but we need a way of detecting such

approximate duplicates incrementally and agging them to the user. We also need an interactive way of

specifying the mapping between schemas from di�erent sources, along the lines of [17].

Currently we assume that each attribute of a tuple is an atomic type. A interesting extension is to handle

nested and semi-structured data formats that are now likely with XML becoming popular. We have seen

that the transforms we provide can handle some transformations involving nested data, but the exact power

of the transforms in this regard needs to be studied further. We also want to explore ways of detecting

structural and semantic discrepancies in semi-structured data. An additional avenue for future work is more

detailed optimizations of transforms, such as coalescing successive regular expression Formats together.

Acknowledgment

The scalable spreadsheet interface that we used was developed along with Andy Chou. Ron Avnur, Mike

Carey, H.V. Jagadish, Laura Haas, Peter Haas, Marti Hearst, Renee Miller, Mary Tork Roth, and Peter

Schwarz made many useful suggestions for the design of the transforms and the user interface. Renee

Miller pointed us to work on handling schematic heterogeneities and suggested ways of handling them.

Subbu Subramanian gave us pointers to related work in transformations. We used (source-code) modi�ed

versions of the PCRE-2.01, ispell-3.1.20, and calc-1.00 [18, 23, 26] libraries to support Perl-compatible

regular expressions, perform spelling checks, and perform arithmetic operations respectively. Computing

and network resources were provided through NSF RI grant CDA-9401156. This work was supported by a

grant from Informix Corporation, a CaliforniaMICRO grant, NSF grant IIS-9802051, a Microsoft Fellowship,

and a Sloan Foundation Fellowship.

19

Page 20: Potter's Wheel: An Interactive Framework for Data Cleaning and

References

[1] ACR/Data. http://www.unitechsys.com/products/ACRData.html.

[2] B. Adelberg. NoDoSE | a tool for semi-automatically extracting structured and semistructured data

from text documents. In SIGMOD, 1998.

[3] A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases.

In KDD, 1996.

[4] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. In SIGMOD

Record, 1997.

[5] W. Chen, M. Kifer, and D. S. Warren. HiLog: A foundation for higher-order logic programming. In

Journal of Logic Programming, pages 187{230, 1993.

[6] COGNOS Accelerator. http://www.cognos.com/accelerator/index.html.

[7] W. Cohen. Integration of heterogeneous databases without common domains usings queries based on

textual similarity. In SIGMOD, 1998.

[8] CoSORT. http://www.iri.com/external/dbtrends.htm.

[9] DataBuilder. http://www.iti-oh.com/pdi/builder1.htm.

[10] Data Junction. http://www.datajunction.com/products/datajunction.html.

[11] Data Stage. http://www.ardentsoftware.com/datawarehouse/datastage/.

[12] Migration Architect. http://www.evokesoft.com/products/ProdDSMA.html.

[13] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The KDD process for extracting

useful knowledge from volumes of data. Communications of the ACM, 39(11), November 1996.

[14] FEDSTATS. fhttp://www.fedstats.gov.

[15] J. Gray and A. Reuter. Transaction processing: concepts and techniques. Morgan Kau�man, 1993.

[16] M. Gyssens, L. Lakshmanan, and S. Subramanian. Tables as a paradigm for querying and restructuring.

In PODS, 1996.

[17] L. Haas et al. Transforming heterogeneous data with database middleware: Beyond integration. IEEE

Data Engg. Bulletin, 1999.

[18] P. Hazel. PCRE 2.03. ftp://ftp.cus.cam.ac.uk/pub/software/programs/pcre/.

[19] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.

[20] M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem.

DMKD Journal, 1997.

[21] Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. E�cient discovery of functional and approxi-

mate dependencies using partitions. In ICDE, 1998.

20

Page 21: Potter's Wheel: An Interactive Framework for Data Cleaning and

[22] J. Kivinen and H. Manilla. Approximate dependency inference from relations. Theoretical Computer

Science, 149(1):129{149, 1995.

[23] G. Kuennig. International ispell version 3.1.20. ftp.cs.ucla.edu.

[24] L. Lakshmanan, F. Sadri, and I. Subramanian. SchemaSQL: A language for intereoperability in rela-

tional multi-database systems. In VLDB, 1996.

[25] L. Lakshmanan, F. Sadri, and S. Subramanian. On e�ciently implementing SchemaSQL on a SQL

database system. In VLDB, 1999.

[26] R. K. Lloyd. calc-1.00. http://hpux.cae.wisc.edu.

[27] H. Mannila and K. Raiha. Algorithms for inferring functional dependencies. Data and Knowledge

Engineering, pages 83{99, 1994.

[28] R. J. Miller. Using schematically heterogeneous structures. In SIGMOD, 1998.

[29] S. Nestorov, S. Abiteboul, and R. Motwani. Inferring structure in semistructured data. In Workshop

on Management of Semistructured Data, 1997.

[30] J. Park and A. Segev. Using common subexpressions to optimize multiple queries. In ICDE, 1988.

[31] V. Raman, A. Chou, and J. M. Hellerstein. Scalable spreadsheets for interactive data analysis. In

DMKD Workshop, 1999.

[32] V. Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing.

In VLDB, 1999.

[33] M. T. Roth and P. M. Schwartz. Don't scrap it, wrap it! A wrapper architecture for legacy data sources.

In VLDB, 1997.

[34] T. Sellis. Multiple-query optimization. TODS, 1988.

[35] B. Shneiderman. The future of interactive systems and the emergence of direct manipulation. Behavior

and Information Technology, 1(3):237{256, 1982.

[36] S. Vandenberg and D. DeWitt. Algebraic support for complex objects with arrays, identity, and inher-

itance. In SIGMOD, 1991.

[37] White Crane's Auto Import. http://www.white-crane.com/ai1.htm.

21

Page 22: Potter's Wheel: An Interactive Framework for Data Cleaning and

A Power of Transforms

In this section we analyse the power of vertical transforms and horizontal transforms. We use n-ways versions

of the Split and Merge transforms for simplicity| these can be implemented by n�1 applications of regular

Splits and Merges respectively.

A.1 Power of Vertical Transforms

Theorem 1: Vertical Transforms, along with Format, can be used to perform all one-to-one mappings of

rows.

Proof:

Suppose that we want to map a row (a1; a2; : : : ; an) to (b1; b2; : : : ; bm). Let bi be de�ned as bi =

gi(ai1 ; ai2; : : : ; ail). Assume that j is a character not present in the alphabet from which the values are

chosen. An obvious way of performing this transformation is as follows:

split(format(merge(a1; a2; : : : ; an; j); udf); j)

where split and merge are m-ary and n-ary versions of the Split and Merge transforms de�ned in Section 3

respectively, and udf is a UDF that converts a1ja2j � � � jan into b1jb2j � � � jbm. While this approach allows us

to use only a few transforms, it forces us to write unnecessary UDFs.

However, the use of Drop and Copy transforms allows one to do the transformation using only UDFs

g1; : : : ; gm { these UDFs are essential because they are explicitly used in the de�nition of b1; : : :bm9. Since

a given aj may be used in multiple conversion functions gi and a Format automatically drops the old value

(Table 1), we need to make an explicit copy of it. Hence, to form bi, we �rst make a Copy of ai1 ; : : :ail,

Merge these to form ai1 jai2j � � � jail, and apply gi on this merged value to form bi. After applying this process

to form b1; : : : ; bm, we must Drop a1; : : : ; an.

A.2 Power of Horizontal Transforms

We prove that by combining Horizontal and Vertical Transforms and Format, we can perform one-to-many

transformations of rows in a table. In addition, by using Format for demoting, and Fold and Unfold, we can

move information between schema and data and thereby atten and un atten tables. The Fold and Unfold

transforms are essentially the same as the restructuring operators of SchemaSQL, and the only restructuring

operators of SchemaSQL that we miss are Unite and Split that are used for manipulating multiple tables.

For a more detailed analysis of the power of these restructuring operators for attening tables, see [25, 16].

Theorem 2: Horizontal Transforms when combined with Vertical transforms and Format can perform all

one-to-many mappings of rows.

Proof:

Suppose that we want to map a row (a1; a2; : : : ; an) to a set of rows

f(b1;1; : : : ; b1;m); (b2;1; : : : ; b2;m); : : : ; (bk;1; : : : ; bk;m)g. Note that the number of output rows k itself can vary

as a function of (a1; : : : ; an). Let K be the maximum value of k for all rows in the domain of the desired

mapping. Assume that j is a character not present in the alphabet from which the values are chosen.

We �rst perform a one-to-one mapping on (a1; : : : ; an) to form

(b1;1jb1;2j � � � jb1;m; b2;1jb2;2j � � � jb2;m; : : : ; bk;1jbk;2j � � � jbk;m; NULL; NULL; : : : ), with K � k

NULLs at the end. We then perform a K way Fold, and a Filter to remove all the resulting NULLs.

Finally, we perform a m-way Split by j to get the desired mapping.

9In many cases it will be possible to express these functions g1; : : : gm in terms of regular-expressionand arithmetic-expression

based Formats, thus avoiding any user programming.

22

Page 23: Potter's Wheel: An Interactive Framework for Data Cleaning and

UnfoldSet

George

Anna

Latimer

Latimer

Family

JoanSmith

MarySmith

BobLatimer

members

Bob

Mary

GeorgeAnna

Joan

Latimer

Smith

Family

Figure 17: Unfolding a set of values, without an explicit column name to align

B Unfolding Sets of Values

As mentioned in Section 3.4 we often need to unfold sets of values, where there is no column name to align

the unfolded values. Figure 17 shows an example. In this case any alignment su�ces. Our main Unfold

operator needs a column from which it can promote column names for the unfolded values and use these

names for alignment. Hence we use the following variant of Unfold for sets.

UnfoldSet(T; i) on the i'th column of a table T with n columns named c1 : : : cn (a column with no name

is assumed to have a NULL name) produces a new table with n+m� 1 columns named

c1 : : : ci�1; ci+1 : : : cn; NULL;NULL; : : :(m NULLs), where m is the size of the largest set of rows in T with

identical values in all columns except the i'th column.

For any maximal set of k tuples S = f(a1; a2; : : : ; ai�1; vl; ai+1; : : : ; an) j l = 1; 2; : : :kg, with identical

values in all columns except the i'th, UnfoldSet generates one tuple

(a1; : : : ; ai�1; ai+1; : : : ; an; v1; v2; : : :vk; NULL;NULL; : : :(m � k) NULLs). Note that the ordering of v1; : : : vk

is not speci�ed by the de�nition, because UnfoldSet does not enforce an alignment. In Figure 17, the fam-

ily members' names could be permuted in any way in the resulting table. In an implementation a default

ordering, such as lexicographic, must be used.

23