unsupervised inference of data formats in human-readable notation christopher scaffidi carnegie...

Unsupervised Inference of Unsupervised Inference of Data Formats inData Formats in

Human-Readable NotationHuman-Readable Notation

Christopher Scaffidi

Carnegie Mellon University

22

Target audienceTarget audience

• In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.

• Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”).

• For professional programmers, programs are a deliverable.

• For EUPs, programs are a means to an end.

motivation ● overview ● algorithm ● evaluation ● conclusion

33

Current practice:Current practice:Storing data as stringsStoring data as strings

• Typical tools of EUPs:– Excel “text” cells

– Access/MSSQL “varchar” fields

– FrontPage/Dreamweaver “textfield” inputs

• Validation involves…– Learning an exotic new notation (VBScript, regexps, etc)

– Writing cumbersome expressions in that notation

• Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations.


44

Our Topes system to date…Our Topes system to date…

• Formats are presented in human-readable notationin our format editor– Format = sequence of parts with constraints on parts

– Constraints can be “often” true (rather than “always”)

• The format is automatically converted to a context-free grammar, with constraints attached to productions.

• At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value.


55

Needed: Format inferenceNeeded: Format inference

• Problem: To date, the system offers limited support for helping users to get started.

• Although users do not need to learn specialized notation, there is still the cognitive work of…– examining data

– breaking it into parts

– representing parts in the format

• Solution: Infer a boilerplate format from examples


66

Talk OutlineTalk Outline

• Motivation / Problem

• Solution– Overview of Topei

– Inference algorithm

• Evaluation

• Conclusion


77

PrototypePrototypeTask flow diagramTask flow diagram

Algorithm infers a format from cell

values

User reviews and customizes

format

User creates a format from

scratch

User loads an existing format

from a file

Plug-in flags cells that don’t match format

User highlights spreadsheet

cells

[1][6]

or

or


88

Sample task: validating a spreadsheetSample task: validating a spreadsheetwith the prototype we have builtwith the prototype we have built

• The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in.


99

Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing an inferred formatCustomizing an inferred format

• Inferred format is presented in editor with sentence-like prompts to improve human-readability

• User can specify meaningful names for parts


1010

Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing constraints in our prototypeCustomizing constraints in our prototype

• User can add/edit constraints


1111

Sample task: validating a spreadsheetSample task: validating a spreadsheetFlagging potential errorsFlagging potential errors

• A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message


1212

Our algorithm has 2 phasesOur algorithm has 2 phases

Input: An array of strings

Phase 1: Identify format parts

Phase 2: Identify constraints on each part of each format

Output: An array of formats– Sorted according to how many examples they match


1313

Phase 1: Identify format partsPhase 1: Identify format parts

• For each string, replace each character with its class, then collapse runs, generating a string “signature”– Supported character classes:

A uppercase letter

a lowercase letter

0 digit

• Example:[email protected] [email protected]

[email protected] [email protected]






1414


• Pack strings with identical signatures (often leads to significant performance improvement)

• Example:[email protected] [email protected]


[email protected] <<packed with 1st, above>>





1515


• Align signatures based on separators

• Example:[email protected] a @ a . a

[email protected] a0 @ a . a

[email protected] A @ A . a




1616


• Abstract to least general composite character class, yielding the parts of each format.

• Example (3 formats below):[email protected] a @ a . a

[email protected] a0 @ a . a

[email protected] A @ A . a

a0A aA a

[email protected] a @ a - a . a

[email protected] a . a @ a . a . a


1717

Phase 2: Identify constraints on each partPhase 2: Identify constraints on each part

• Constrain each part’s contents to the character classes

• Require indicated separators before/after parts

• Infer an additional content constraint that is “often” true:– Must be in a set of 3 or fewer literals?

– Must be in a numeric range?

– Must start with or end with certain characters?

• A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature.

• Afterward, the user can review/customize format.


1818

Evaluation as an outlier finderEvaluation as an outlier finder

• Outlier finding:– Infer a format from example values

– Use the inferred format to check the examples Reveals “outliers” that might contain typos or other errors

• Comparison algorithm: LapisLapis example@DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing@ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing@ShortYear is Number equal to /\d\d/ ignoring nothingDate is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation


1919

Evaluation dataEvaluation dataDrawn from EUSES spreadsheet corpusDrawn from EUSES spreadsheet corpus

• 6288 US phone numbers in 37 columns– First cell in column contains “phone”

– And at least 20 cells have exactly 10 digits

– And at least 2/3 of cells have exactly 10 digits

• 1124 country names in 7 columns– First cell in column contains “country”

– And there are at least 20 cells

– And at least one cell contains “Portugal”


2020

Run each algorithm (Topei & Lapis)Run each algorithm (Topei & Lapis)and compare their output to hand-labelingand compare their output to hand-labeling

• For determining “true outliers” in calculating accuracy:– Outlier phone numbers have an area code that is

not in service, or if they contain errant separators such as spaces not shared by most cells in the column.

– Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English-speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy]


2121

Results: Topei’s precision/recall Results: Topei’s precision/recall exceed Lapis’sexceed Lapis’s

• Standard machine learning measures for outlier finding– Precision = # outliers found / # outliers claimed– Recall = # outliers found / # true outliers

Task Algorithm Precision (%) Recall (%)

Country Topei 56.5 94.6

Country Lapis 46.7 7.6

Phone Topei 97.7 99.8

Phone Lapis 44.0 2.4


2222

Limitations & future workLimitations & future work

• Topei still makes mistakes:– Doesn’t infer constraints aggressively enough

– Doesn’t recognize non-ASCII chars in character classes

– Doesn’t handle formats with repeating parts

• Need deeper integration with EUPs’ tools

• Computational complexity:– Is intended to be O(# examples), seems to be true

– More careful verification needed

• Usability has not yet been evaluated in user study


2323

Thank You…Thank You…

• …to you for your interest and attention

• …to INSTICC for the opportunity to present

• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)


2424

Another example:Another example:Carnegie Mellon University phone #: 8-1234Carnegie Mellon University phone #: 8-1234


2525

IntegrationIntegration

• Formats can be inferred from…– Spreadsheet cells

– Database queries (e.g.: Access/MSSQL)

– Arbitrary collection of text strings (via C#)

• Formats can then be used without modification in other venues, as well.– E.g.: infer a format from spreadsheet cells, then use it to

create a trigger for a database table


2626

Related WorkRelated Work

• Many algorithms train a recognizer to notice features – See (Mitchell, 1997) for a summary

– Such algorithms do not infer a human-editable format.

• Others generate formats in specialized notation. – (Miller, 2001) (Blackwell, 2001) (Lerman, 2000),

(Lieberman, 2001) (Nardi, 1998)

– Regular expressions and CFGs have limited readability.

• Several tools recognize or manipulate some of the same kinds of data as Topei.– (Hong, 2006) (Pandit, 1997) (Stylos 2004)

– Custom formats are unsupported (only hardcoded formats)


unsupervised inference of data formats in human-readable notation christopher scaffidi carnegie...

Documents