learning semantic string transformations from examples
DESCRIPTION
Learning Semantic String Transformations from Examples. Rishabh Singh and Sumit Gulwani. FlashFill. Transformations. Syntactic Transformations Concatenation of regular expression based substring “VLDB2012” “VLDB” Semantic Transformations More than just characters - PowerPoint PPT PresentationTRANSCRIPT
Learning Semantic String Transformations from
ExamplesRishabh Singh and Sumit
Gulwani
FlashFill
Transformations
• Syntactic Transformations – Concatenation of regular expression based
substring
– “VLDB2012” “VLDB”
• Semantic Transformations–More than just characters– “1/5/2010” “May 1st 2010”
Semantic Transformations
• Semantic information as relational tables– 1 January, 2 February
• Learn table lookup queries– VLOOKUP macro 2nd most problematic
Outline
• Lookup Transformations
• Lookup + Syntactic Transformations
• Case Studies
Table Lookup Transformati
ons
Demo
Learning Framework
Input Strings
FOutput String
F1
1. Domain-specific Language L
Fn…
2. Algorithm to learn all Fs from (i,o)
Lookup Transformation Language
Emp Record
SSN EmpId Name
027-36-4557 1254 John Henry
034-83-7683 2412 William Johnson
044-58-3429 1125 Steve Russell
018-45-8949 4257 Ian Jordan
023-34-3254 6418 Mary Dina
Input v1 Output
044-58-3429 Steve Russell
Select(Name, EmpRecord, (SSN = v1))
Example - Lookup
ItemRec
ItemId Item
ST-340 Stroller
BI-567 Bib
DI-328 Diapers
WI-989 Wipes
AS-469 Aspirator
PriceRec
ItemId Price
ST-340 $145.67
BI-567 $3.56
DI-328 $21.45
WI-989 $5.12
AS-469 $2.56
Input v1 Output
Stroller $145.67
Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))
Example – Transitive Lookup
Learn Query
ItemRec
ItemId Item
ST-340 Stroller
BI-567 Bib
DI-328 Diapers
WI-989 Wipes
AS-469 Aspirator
PriceRec
ItemId Price
ST-340 $145.67
BI-567 $3.56
DI-328 $21.45
WI-989 $5.12
AS-469 $2.56
Input v1 Output
Stroller $145.67
Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))
Synthesis Algorithm :
• Input: (input state , output string )
• Output: all conforming expressions
• Reachability algorithm from input strings
GenerateSt r𝑡
Strings reachable from input row044-58-3429
Emp Record
SSN EmpId Name
027-36-4557 1254 John Henry
034-83-7683 2412 William Johnson
044-58-3429 1125 Steve Russell
018-45-8949 4257 Ian Jordan
𝜂1 𝜂2 𝜂3Progs [𝜂 1 ]= {𝑣1 }
GenerateSt r𝑡
strings in table rows of visited nodes 044-58-3429 1125 Steve Russell
)B≡ {∧𝐶𝑖={𝑣𝑎𝑙−1 (𝑇 [𝐶𝑖 ,𝑟 ] ) }} 𝑗
GenerateSt r𝑡
……..Repeat until k steps or
fixpoint
GenerateSt r𝑡
……..
Steve Russell
𝜂Progs [𝜂 ]
GenerateSt r𝑡• Sound and k-complete
– t: number of reachable strings– p: number of candidate keys–m: maximum size of a candidate key
Data structure
• Maintains tree structure– share common sub-expressions
• CNF of Boolean Conditionals– independent column predicates
Intersect t :D t1∧Dt 2
∧ ≡
Synthesize Procedure
Synthesize((i1,o1), …, (in,on))
P = GenerateStrt(i1,o1)
for j = 2 to n: P’ = GenerateStrt(ij,oj)
P = Intersectt(P’, P)
return P
Semantic String
Transformations
Demo
Syntactic String Language [GulwaniPOPL11]
Combined Language
Syntactic manipulations over lookup outputs
Syntactic manipulations before indexing
Synthesis Algorithm:
– Reachability based on syntactic string matches•
– Boolean conditionals
GenerateSt r𝑢SSN: 044-58-3429
Emp Record
SSN EmpId Name
027-36-4557 1254 John Henry
034-83-7683 2412 William Johnson
044-58-3429 1125 Steve Russell
018-45-8949 4257 Ian Jordan
Mr. Steve Russell
GenerateSt r𝑢SSN: 044-58-3429
Emp Record
SSN EmpId Name
027-36-4557 1254 John Henry
034-83-7683 2412 William Johnson
044-58-3429 1125 Steve Russell
018-45-8949 4257 Ian Jordan
GenerateSt r ′𝑡
GenerateSt r𝑢SSN: 044-58-3429
Emp Record
SSN EmpId Name
027-36-4557 1254 John Henry
034-83-7683 2412 William Johnson
044-58-3429 1125 Steve Russell
018-45-8949 4257 Ian Jordan
GenerateSt r ′𝑡
GenerateSt r𝑢{ “SSN: 044-58-3429”, “044-58-3429”, “1125”, “Steve Russell” } Set of reachable
strings
GenerateSt r𝑢
GenerateSt r𝑠
{ “SSN: 044-58-3429”, “044-58-3429”, “1125”, “Steve Russell” }
Mr. Steve Russell
and in paper
Experiments
• 50 benchmark problems– 12 , 38
• ~1020 consistent expressions– Size of data structure: ~2000
• Performance: 96% less than 1 second
• Ranking: at most 3 examples (95% 2 examples)
Related Work
• Matching strings for table joins– Record Matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06]– Schema Matching [Dhamankar et. al. SIGMOD04, Warren & Tompa
VLDB06]
• Query Synthesis– from representative view [Das Sharma et.al. ICDT10, Tran et.al.
SIGMOD09]
• Text-editing by example– QuickCode[Gulwani POPL11]– SMARTedit[Lau et.al. ML03], Simulatenous Editing[Miller
et.al. USENIX01]
Thanks!
End-Users
Algorithm Designers
Software Developers
Large potential
Backup slides
Semantic String Transformations
Time (12 Hr) Time (24 Hr)
0930 9:30 AM
1520 3:20 PM
1648
0830
1015
2010
1012
1425
=TEXT(C,”00 00”)+0
Semantic String Transformations
Date Formatted Date
06-03-2008 Jun 3rd, 2008
03-26-2010
08-01-2009
09-24-2007
05-14-2010
07-20-1998
10-24-2004
08-24-1972
Idea 1: Share sub-expressions
T3
C1 C2 C3
s3 s4 s5
T1
C1 C2 C3
s1 s2 s3
T2
C1 C2 C3
s2 s3 s4
Select(C3, T2, C1=e)
Select(C2, T3, C1=Select(C2,T2,C1=e)
e Select(C2, T1, C1=v1)𝑠2
Youtube Videos
FrenchPolishUrduGermanSerbianRussian
http://bit.ly/flashfill
Idea 2: CNF conditionals
T
C1 C2 C3 … Cn Cn+1
s s s s t
v1 v2 … vm Out
s s s t
No. of Consistent Expressions
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 491
10000
100000000
1000000000000
1E+016
1E+020
1E+024
1E+028
1E+032
1E+036
Large number of consistent expressions
Benchmarks
Nu
mb
er
of
exp
ressio
ns
Succinct Representation
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
500
1,000
1,500
2,000
Succinct Representation
Benchmarks
Siz
e o
f D
ata
Str
uctu
re
Performance
1 6 11 16 21 26 31 36 41 460.00
2.00
4.00
6.00
8.00
10.00
12.00
Running Time
Benchmarks
Ru
nn
ing
Tim
e (
in s
econ
ds)
Ranking
1 2 30
5
10
15
20
25
30
35
40
Ranking Measure
Number of I/O examples
Nu
mb
er
of
Be
nch
ma
rks
Idea 2: CNF conditionals
{{𝜂1 ,𝜂2 } ,𝜂2 ,Progs }Progs [𝜂1 ]≡ {𝑣1 ,𝑣2 ,⋯ ,𝑣𝑚}
Progs [𝜂2 ]={Select (C𝑛+1 ,𝑇 ,∧𝑖C i= {𝑠 ,𝜂1 })}
𝑚+1Θ ((𝑚+1 )𝑛 )
GenerateSt r𝑡
: string value𝜂
: set of lookup programs to generate
𝑣𝑎 𝑙−1 (𝑠 ):Node𝜂 ,𝑣𝑎𝑙 (𝜂 )=𝑠
Related Work
• Record Matching – Similarity functions for matching [Elmagarmid et. al.
07, Koudas et. al. SIGMOD06]– Customizable similarity function [Arasu et. al. VLDB09]
• Learning Schema Matches– iMAP [Dhamankar et. al. SIGMOD04] concat. of
column strings using domain-specific knowledge
– [Warren & Tompa VLDB06] concatenation of column substrings, single table
Related Work
• Query Synthesis [Das Sharma et.al. ICDT10, Tran et.al. SIGMOD09]
– Infer relation from large representative example view
– no joins or projections
• Text-editing using examples– QuickCode[Gulwani POPL11] string
transformations– SMARTedit[Lau et.al. ML03], Simulatenous
Editing[Miller et.al. USENIX01] programming by demonstration
General Framework
• A Domain-specific Transformation Language L– Expressive and succinct
• Efficient Data structures for set of expressions– Version-space algebra
• GenerateStr – All sets of expressions from I-O example
• Intersect– Intersect two sets of expressions
Emp Record
SSN EmpId Name
027-36-4557 1254 John Henry
034-83-7683 2412 William Johnson
044-58-3429 1125 Steve Russell
018-45-8949 4257 Ian Jordan
023-34-3254 6418 Mary DinaInput v1 Output
044-58-3429 Steve Russell
023-34-3254
Select(Name, EmpRecord, (SSN = v1))
Example - Lookup
ItemRec
ItemId Item
ST-340 Stroller
BI-567 Bib
DI-328 Diapers
WI-989 Wipes
AS-469 Aspirator
PriceRec
ItemId Price
ST-340 $145.67
BI-567 $3.56
DI-328 $21.45
WI-989 $5.12
AS-469 $2.56
Input v1 Output
Stroller $145.67
Bib
Aspirator
Wipes
Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))
Example – Transitive Lookups
Data Structure
Data structure for expressions
Data structure
Data structure
Data structure
T1
C1 C2 C3
s1 s2 s3
T2
C1 C2 C3
s2 s3 s4
Ti
C1 C2 C3
si si+1 si+2
Example
…TmInput v1 Output
s1 sm
Ti-1
C1 C2 C3
si-1 si si+1
Ti-2
C1 C2 C3
si-2 si-1 si
Sub-expression Sharing
𝑠𝑖
Sub-expression Sharing
𝑠𝑖− 1 𝑠𝑖𝑠𝑖− 2
𝜂𝑖
𝜂𝑖− 1
𝜂𝑖− 2
Sub-expression Sharing
{{𝜂1 ,𝜂2 ,⋯ ,𝜂𝑚 } ,𝜂𝑚 , Progs }
Progs [𝜂1 ]≡ {𝑣1 }Progs [𝜂2 ]={Select (C2 , T 1,C1= {s1 ,𝜂1 }) }
Sub-expression Sharing
𝑁 (𝑖 )=𝑁 (𝑖−1 )+𝑁 (𝑖−2)
𝑁 (𝑖 )=Θ (2𝑖 ){{𝜂1 ,𝜂2 ,⋯ ,𝜂𝑚 } ,𝜂𝑚 , Progs }
Progs [𝜂1 ]≡ {𝑣1 }Progs [𝜂2 ]={Select (C2 , T 1,C1= {s1 ,𝜂1 }) }
Intersect t :D t1∧Dt 2
Current State of the Art: Help forums
Observations
• Semantic string transformations
• Input-output examples based interaction– New disambiguating inputs
• Add-in with the same interface