unsupervised inference of data formats in human-readable notation christopher scaffidi carnegie...
Post on 19-Dec-2015
226 views
TRANSCRIPT
Unsupervised Inference of Unsupervised Inference of Data Formats inData Formats in
Human-Readable NotationHuman-Readable Notation
Christopher Scaffidi
Carnegie Mellon University
22
Target audienceTarget audience
• In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.
• Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”).
• For professional programmers, programs are a deliverable.
• For EUPs, programs are a means to an end.
motivation ● overview ● algorithm ● evaluation ● conclusion
33
Current practice:Current practice:Storing data as stringsStoring data as strings
• Typical tools of EUPs:– Excel “text” cells
– Access/MSSQL “varchar” fields
– FrontPage/Dreamweaver “textfield” inputs
• Validation involves…– Learning an exotic new notation (VBScript, regexps, etc)
– Writing cumbersome expressions in that notation
• Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations.
motivation ● overview ● algorithm ● evaluation ● conclusion
44
Our Topes system to date…Our Topes system to date…
• Formats are presented in human-readable notationin our format editor– Format = sequence of parts with constraints on parts
– Constraints can be “often” true (rather than “always”)
• The format is automatically converted to a context-free grammar, with constraints attached to productions.
• At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value.
motivation ● overview ● algorithm ● evaluation ● conclusion
55
Needed: Format inferenceNeeded: Format inference
• Problem: To date, the system offers limited support for helping users to get started.
• Although users do not need to learn specialized notation, there is still the cognitive work of…– examining data
– breaking it into parts
– representing parts in the format
• Solution: Infer a boilerplate format from examples
motivation ● overview ● algorithm ● evaluation ● conclusion
66
Talk OutlineTalk Outline
• Motivation / Problem
• Solution– Overview of Topei
– Inference algorithm
• Evaluation
• Conclusion
motivation ● overview ● algorithm ● evaluation ● conclusion
77
PrototypePrototypeTask flow diagramTask flow diagram
Algorithm infers a format from cell
values
User reviews and customizes
format
User creates a format from
scratch
User loads an existing format
from a file
Plug-in flags cells that don’t match format
User highlights spreadsheet
cells
[1][6]
or
or
motivation ● overview ● algorithm ● evaluation ● conclusion
88
Sample task: validating a spreadsheetSample task: validating a spreadsheetwith the prototype we have builtwith the prototype we have built
• The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in.
motivation ● overview ● algorithm ● evaluation ● conclusion
99
Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing an inferred formatCustomizing an inferred format
• Inferred format is presented in editor with sentence-like prompts to improve human-readability
• User can specify meaningful names for parts
motivation ● overview ● algorithm ● evaluation ● conclusion
1010
Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing constraints in our prototypeCustomizing constraints in our prototype
• User can add/edit constraints
motivation ● overview ● algorithm ● evaluation ● conclusion
1111
Sample task: validating a spreadsheetSample task: validating a spreadsheetFlagging potential errorsFlagging potential errors
• A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message
motivation ● overview ● algorithm ● evaluation ● conclusion
1212
Our algorithm has 2 phasesOur algorithm has 2 phases
Input: An array of strings
Phase 1: Identify format parts
Phase 2: Identify constraints on each part of each format
Output: An array of formats– Sorted according to how many examples they match
motivation ● overview ● algorithm ● evaluation ● conclusion
1313
Phase 1: Identify format partsPhase 1: Identify format parts
• For each string, replace each character with its class, then collapse runs, generating a string “signature”– Supported character classes:
A uppercase letter
a lowercase letter
0 digit
• Example:[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
motivation ● overview ● algorithm ● evaluation ● conclusion
1414
Phase 1: Identify format partsPhase 1: Identify format parts
• Pack strings with identical signatures (often leads to significant performance improvement)
• Example:[email protected] [email protected]
[email protected] [email protected]
[email protected] <<packed with 1st, above>>
[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
motivation ● overview ● algorithm ● evaluation ● conclusion
1515
Phase 1: Identify format partsPhase 1: Identify format parts
• Align signatures based on separators
• Example:[email protected] a @ a . a
[email protected] a0 @ a . a
[email protected] A @ A . a
[email protected] [email protected]
[email protected] [email protected]
motivation ● overview ● algorithm ● evaluation ● conclusion
1616
Phase 1: Identify format partsPhase 1: Identify format parts
• Abstract to least general composite character class, yielding the parts of each format.
• Example (3 formats below):[email protected] a @ a . a
[email protected] a0 @ a . a
[email protected] A @ A . a
a0A aA a
[email protected] a @ a - a . a
[email protected] a . a @ a . a . a
motivation ● overview ● algorithm ● evaluation ● conclusion
1717
Phase 2: Identify constraints on each partPhase 2: Identify constraints on each part
• Constrain each part’s contents to the character classes
• Require indicated separators before/after parts
• Infer an additional content constraint that is “often” true:– Must be in a set of 3 or fewer literals?
– Must be in a numeric range?
– Must start with or end with certain characters?
• A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature.
• Afterward, the user can review/customize format.
motivation ● overview ● algorithm ● evaluation ● conclusion
1818
Evaluation as an outlier finderEvaluation as an outlier finder
• Outlier finding:– Infer a format from example values
– Use the inferred format to check the examples Reveals “outliers” that might contain typos or other errors
• Comparison algorithm: LapisLapis example@DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing@ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing@ShortYear is Number equal to /\d\d/ ignoring nothingDate is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation
motivation ● overview ● algorithm ● evaluation ● conclusion
1919
Evaluation dataEvaluation dataDrawn from EUSES spreadsheet corpusDrawn from EUSES spreadsheet corpus
• 6288 US phone numbers in 37 columns– First cell in column contains “phone”
– And at least 20 cells have exactly 10 digits
– And at least 2/3 of cells have exactly 10 digits
• 1124 country names in 7 columns– First cell in column contains “country”
– And there are at least 20 cells
– And at least one cell contains “Portugal”
motivation ● overview ● algorithm ● evaluation ● conclusion
2020
Run each algorithm (Topei & Lapis)Run each algorithm (Topei & Lapis)and compare their output to hand-labelingand compare their output to hand-labeling
• For determining “true outliers” in calculating accuracy:– Outlier phone numbers have an area code that is
not in service, or if they contain errant separators such as spaces not shared by most cells in the column.
– Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English-speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy]
motivation ● overview ● algorithm ● evaluation ● conclusion
2121
Results: Topei’s precision/recall Results: Topei’s precision/recall exceed Lapis’sexceed Lapis’s
• Standard machine learning measures for outlier finding– Precision = # outliers found / # outliers claimed– Recall = # outliers found / # true outliers
Task Algorithm Precision (%) Recall (%)
Country Topei 56.5 94.6
Country Lapis 46.7 7.6
Phone Topei 97.7 99.8
Phone Lapis 44.0 2.4
motivation ● overview ● algorithm ● evaluation ● conclusion
2222
Limitations & future workLimitations & future work
• Topei still makes mistakes:– Doesn’t infer constraints aggressively enough
– Doesn’t recognize non-ASCII chars in character classes
– Doesn’t handle formats with repeating parts
• Need deeper integration with EUPs’ tools
• Computational complexity:– Is intended to be O(# examples), seems to be true
– More careful verification needed
• Usability has not yet been evaluated in user study
motivation ● overview ● algorithm ● evaluation ● conclusion
2323
Thank You…Thank You…
• …to you for your interest and attention
• …to INSTICC for the opportunity to present
• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)
motivation ● overview ● algorithm ● evaluation ● conclusion
2424
Another example:Another example:Carnegie Mellon University phone #: 8-1234Carnegie Mellon University phone #: 8-1234
motivation ● overview ● algorithm ● evaluation ● conclusion
2525
IntegrationIntegration
• Formats can be inferred from…– Spreadsheet cells
– Database queries (e.g.: Access/MSSQL)
– Arbitrary collection of text strings (via C#)
• Formats can then be used without modification in other venues, as well.– E.g.: infer a format from spreadsheet cells, then use it to
create a trigger for a database table
motivation ● overview ● algorithm ● evaluation ● conclusion
2626
Related WorkRelated Work
• Many algorithms train a recognizer to notice features – See (Mitchell, 1997) for a summary
– Such algorithms do not infer a human-editable format.
• Others generate formats in specialized notation. – (Miller, 2001) (Blackwell, 2001) (Lerman, 2000),
(Lieberman, 2001) (Nardi, 1998)
– Regular expressions and CFGs have limited readability.
• Several tools recognize or manipulate some of the same kinds of data as Topei.– (Hong, 2006) (Pandit, 1997) (Stylos 2004)
– Custom formats are unsupported (only hardcoded formats)
motivation ● overview ● algorithm ● evaluation ● conclusion