potter’s wheel: an interactive data cleaning system vijayshankar raman joseph m. hellerstein

29
Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Upload: moses-flowers

Post on 16-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Potter’s Wheel: An Interactive Data Cleaning

System

Vijayshankar Raman

Joseph M. Hellerstein

Page 2: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Outline

Background

Potter’s Wheel architecture

Discrepancy detection

Interactive transformation

Conclusions and Future Work

Page 3: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Motivation

Dirty data common E.g., in content integration, e-catalogs

Inter-organizational differences in data representation Home Depot: 60,000 suppliers!

Data often scraped off web pages, etc. E.g. in centralized systems

Data entry “errors”, poor integrity constraints

Cleansing a prereq for analysis, xactionsCleansing done by “content managers” Ease of use critical!

Standards can help a bit (e.g. UDDI) But graphical tools are the name of the game

Page 4: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Current solutions

Detect errors in data “eyeball” data in a spreadsheet data auditing tools domain-specific algorithms

Code up transforms to fix errors “ETL” (extract/transform/load) tools from warehousing world string together domain-specific cleansing rules scripting languages, custom code, etc.

Apply transforms on dataIterate special cases nested discrepancies, e.g. 19997/10/31

Code

Apply

Detect

Page 5: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Problems

Slow, batch tasks

Significant human effort! Specification of transforms

regular expressions, grammars, custom scripts, etc.

Discrepancy detection notion of discrepancy domain-dependent want a mix of custom and standard techniques want to apply on parts of the data values

Rebecca by Daphne du Maurier (Mass Market Paperback) $6.29 ****

Sonnet 19. Craig W.J., ed. 1914. The Oxford Shakespeare

The Big Four Agatha Christie, Mass market paperback 5.39 10%

(from bartleby.com, bn.com)

Page 6: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Outline

Background

Potter’s Wheel architecture

Discrepancy detection

Interactive Transformation

Conclusions and Future Work

Page 7: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Potter’s Wheel: Design Goals

Eliminate wait time during each step Even on big data! Use Online Reordering (VLDB ‘99), sampling Ensure transform results can be seen/undone instantly Compile/optimize sequence of transforms when happy

Eliminate programming, but keep user “in the loop” Semi-automatic, “direct manipulation” GUI Support & leverage “eyeball” detection, verification (human input) Point-and-click transformation “by example”

Unify detection and transformation Detection always runs online in the background Detection always runs on transformed “view” of data

Extensibility Domain experts (vendors) should be able to plug

in detectors/transforms

A mixed (“Systems!”) design challenge: Query Processing, HCI, Learning

Limited appreciationfor this kind of systems work

Page 8: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Potter’s Wheel UIData read so far

Page 9: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Dataflow in Potter’s Wheel

Transformationengine

Sp

rea

dsh

eet

dis

pla

y

Optimized program

Onlinereorderer

Data source

Discrepancy detector

compile

get page scrollbar pos.n

specify/undo transforms

scroll

check for errors

Page 10: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Outline

BackgroundPotter’s Wheel architectureDiscrepancy detection Domains in Potter’s Wheel Structure inference

Interactive TransformationConclusions and Future Work

Page 11: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Discrepancy Detection

Challenge: find discrepancies in a column Structure inference: Given:

A set of (possibly composite) data items, including discrepancies

A set of user-defined “domains” (atomic types) Choose a “structure” for the set

A string of domains (w/repetition) that best fits the data E.g. for “March 17, 2000”:

* alpha* digit*, digit* [Machr]* 17, int

Report rows that do not fit chosen domain

PS: Must be an online algorithm!

Page 12: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Extensible Domains

As in Object-Relational, keep domains opaque.

class Domain {// Required inclusion functionboolean match(char *value);

// Helps in structure extractionint cardinality(int length);

// For probabilistic discrepancy checkingfloat matchWithConfidence(char *value, int dataSetSize);void updateState(char *value);

// Helps in parsingboolean isRedundantAfter(Domain d);

}

e.g. integer, ispell word, money, standard part names

Page 13: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Evaluating Structure Fit

Three desired characteristics Recall

match as many values as possible

Precision flag as many real discrepancies as possible e.g. Month day, day over alpha* digit*, digit*

Conciseness avoid over-fitting examples, make use of the domains e.g. alpha* digit*, digit* over March 17, 2000

Page 14: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Evaluating Structure Fit, cont.

Given structure S = d1d2…dp, string vi, how good is S?Minimum Description Length (MDL) principle

Rissanen, ‘78, etc. DL(vi,S) = length of theory for S + length to encode string vi with S

Computing DL(v,S)1) Length of theory = p log (number of domains known)

2) If vi doesn’t match S, encode it explicitly

3) Else encode vi = wi,1 wi,2 …wi,p where wi,j dj

Encode length of each wi,j

Encode each wi,j among all dj’s of length j use cardinality function

DL = AVGi((1) + (2) + (3)) = AVGi (UnConciseness + UnPrecision + UnRecall)

Choose structure with minimum DL(v,S) Hard search problem; heuristics in paper

Page 15: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Potter’s Wheel UI

Page 16: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Outline

Background

Potter’s Wheel architecture

Discrepancy detection

Interactive Transformation transforms split-by-example

Conclusions and Future Work

Page 17: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Interactive transformation

Sequence of simple visual transforms rather than a single complex program

Each transform must be easy to specify immediately applicable on screen rows

Must be able to undo transforms compensatory transforms not always possible everything REDO-oriented at display-time

no need for UNDO!

Page 18: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Transforms in Potter’s Wheel

Value translation Format(value) – reg. expr. substitution, arithmetic ops,

One-to-one row mappings Add/Drop/Copy columns Merge,Split columns Divide column by predicate

One-to-many row mappings Fold columns

adapted from Fold of SchemaSQL[LSS’96] Resolve some higher-order differences

Page 19: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Example (1)

2 Merges

Format'(.*), (.*)' to '\2 \1'

Stewart,Bob

Dole,JerryDavis

Marsh

Anna

Joan

StewartAnna Davis

DoleJoan MarshJerry

Bob Bob

Jerry

Stewart

DoleAnna

Joan

Davis

Marsh

Split at ' '

Anna

Joan

Davis

Marsh

Bob Stewart

Jerry Dole

Page 20: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Example (2)

Divide (like ’.*,.*’)

Anna Davis

Joan Marsh

Stewart,Bob

Dole,Jerry

Stewart,BobAnna

Dole,JerryJoan

Davis

Marsh

Page 21: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Example (3)

Split

Fold

2 Formats(demotes) Ann

BobMath:43Math:96

Bio:78Bio:54

Name Math4396

AnnBob

7854

Bio

Ann

BobBob

AnnName

Math:96Bio:54

Math:43Bio:78

MathBio

MathBio

AnnAnnBobBob

Name43789654

Name

Page 22: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Power all one-to-{one,many} row mappings interactive many-to-{one,many} mappings hard to do interactively

must find/display companion rows for each row to transform higher-order transforms

Specification click on appropriate columns and choose transform but, Split is hard

important transform in screen-scraping/wrapping need to enter regular expressions not always unambiguous e.g.

want to leverage domains Taylor, Jane, $52,072

Tony Smith, 1,00,533

Transforms summary

Page 23: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Split by Example

User marks split positions on examplesSystem infers structure, then parses rest

Parsing must identify matching substrings for structures multiple alternate parses could work

search heuristics explored in paper DecreasingSpecificity seems good

Taylor, Jane|, $52,072

Tony Smith|, 1,00,533

infer structures < * >, <‘,’ Money>

Page 24: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Related Work

Transformation languages -- e.g. SchemaSQL, YATLData cleaning tools commercial -- ETL and auditing tools research -- e.g. AJAX, Lee/Lu/Ling/Ko ’99

Custom auditing algorithms de-duplication (e.g. Hernandez/Stolfo ’97) outlier detection (e.g. Ramaswamy/Rastogi/Shim ’00) dependency inference (e.g. Kivinen/Manilla ’95)

Structure extraction techniques e.g. XTRACT, DataMold, Brazma ‘94

Transformation tools text-processing tools – e.g. perl/awk/sed, LAPIS screen-scraping -- e.g. NoDoSE, XWRAP, OnDisplay, Cohera Connect,

Telegraph Screen Scraper (TeSS)

Middleware, schema mapping

Page 25: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Conclusions

Interactive data cleaning Couple transformation and discrepancy detection Perform both interactively

short, immediately applied steps specify visually, undo if needed contrast with declarative language

Parse values before discrepancy detection user-defined domains helpful

Software online (http://control.cs.berkeley.edu/abc)

Page 26: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Looking Ahead

Generalizing transform by exampleTransforming nested data (XML, HTML)More complex domain-expressionsExtend to generalized query processor client in Telegraph specify initial query refine by specifying transforms as results stream in dynamically choose transforms to be pushed into

server See Shankar’s upcoming thesis, Telegraph papers

Page 27: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Backup Slides

Page 28: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Optimization of Transform Sequences

In Potter’s Wheel system generates program at end hence opportunities for optimization

remove redundant operationsavoid expensive memory copies/allocations/deallocationsby careful pipeliningmaterialize intermediate strings only when necessaryup to 110% speedup for C programs C programs 10x faster than Perl scripts

Page 29: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein

Example

vs