Determining the Relative Accuracy of Attributes
Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 1
1University of Edinburgh 2Beihang University
2
3
FD: [FN, LN, team, height date of birth]CFD: [team = “Chicago Bulls” arena = “United Center”]
FN LN height date of birth team arena
M J 7ft 1963 Chicago Bulls United Center
Michael Jordan 198cm 17/02/1963 Chicago Chicago StadiumMichael Jordan 198cm 17/02/1963
Chicago Bulls United Center
Instance may be consistent, but its values may still be inaccurate
Applications: Data fusion for big data, decision making, information systems, …
Data Accuracy: a central problem that has not been formally studied
Data Accuracy
Find the most accurate values for Jordan within D
4
FN LN height unit date of birth team arena
M J 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
FN LN team season
Michael Jordan Chicago Bulls 94-95
… … … …
t1
t2
tm
Michael Jordan
Chicago Bulls United Center
Form(1)
Form (2)
Using Accuracy Rules to capture data semantics
D
𝑡𝑚 [𝐹𝑁 ]=𝑡𝑒 [𝐹𝑁 ]∧𝑡𝑚 [𝐿𝑁 ]=𝑡𝑒 [𝐿𝑁 ]❑→
𝑡𝑒 [𝑡𝑒𝑎𝑚 ]=𝑡𝑚 [𝑡𝑒𝑎𝑚 ]
17/02/1963
𝑡1 [𝐹𝑁 ,𝐿𝑁 , h𝑏𝑖𝑟𝑡 ]<𝑡 2 [𝐹𝑁 ,𝐿𝑁 , h𝑏𝑖𝑟𝑡 ]❑→
𝑡 1≼𝐹 ,𝐿,𝑏𝑡 2
𝑡1≺𝑢𝑛𝑖𝑡 𝑡2❑→
𝑡1≼h h𝑒𝑖𝑔 𝑡 𝑡2 𝑡1≺𝑡𝑒𝑎𝑚𝑡 2❑→
𝑡 1≼𝑎𝑟𝑒𝑛𝑎𝑡 2
Accuracy Rules
Inferring Relative Accuracy with ARs
5
FN LN team season
Michael Jordan Chicago Bulls 94-95tm
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
A chase-like procedure with ARs to deduce relative accuracy
t1
t2
D
Dm
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
1 FN/LN/birth 2t t
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ4:
φ1:
φ2: φ3:
φ1φ4
φ3
Michael Jordan 17/02/1963
Chicago Bulls United Center
A chasing sequence
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
Dm
φ1:
φ2: φ3:
Termination Problem: Every chasing sequence always terminates.
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
1 FN/LN/birth 2t t
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ1φ4
φ3
𝛴
D
t2
φ4:
Finite ??(φ1φ4φ3.. φj..)
YES
tm
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
φ1:
φ2: φ3:
φ4:
7
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
1 FN/LN/birth 2t t
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ1φ4
φ3
φ5:
φ5
team1 2t t
0 [team] " "ChicagoDet
team1 2t t arena1 2t t
0 [team] " "ChicagoDet
0 [arena] " "Ch.. St..Det
Whether different chasing sequences coincide?NOT ALWAYS
Church-Rosser Problem: The Church-Rosser property is not guaranteed. But can be checked in cubic time.
t1
Dm
𝛴
D
t2
tm
Fundamental Problems: Deducing candidate targets
8
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
Dm
φ4:
φ1:
φ2: φ3: Σ
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ1φ4
φ3
teMichael Jordan 17/02/1963 Chicago Bulls United Center
1 FN/LN/birth 2t t
? ?
tm
Target tuple may be incomplete: the need to find candidate targets
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1
te'2
Whether candidate targets always exist?NOT ALWAYS
9
Fundamental Problems: Deducing candidate targets
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
Dm
φ4:
φ1:
φ2: φ3:
teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?
tm
φ6:
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1
te'2(φ2, φ6)
It is NP-complete to determine whether there exist candidate targets
There can be exponentially or even infinitely many candidate targets
Σ
10
Fundamental Problems: Top-k candidate targets
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
Dm
φ4:
φ1:
φ2: φ3:
teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?
tm
Σ
Preference model: (k, p(.)): p(.) is any monotone scoring function (e.g., occurrences)
Top-k candidate targets problem: whether there exists a k-set Te such that p(Te) > C
The Top-k candidate targets problem is NP-complete
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1
te'2
K=2, p(Te) = 14
11
A framework for deducing target tuples
𝑆=(𝐷0 , Σ , 𝐼𝑚 , 𝑡𝑒❑𝐷0
❑
) Is S Church-Rosser?
complete te derived?
Compute top-k candidate targets
Te
Preference Model (k,p(.))
feedback
Te
t'e
Yes
No
YesReturn te
No • IsCR
• RankJoinCT• TopKCT• TopKCTh
12
Algorithms
Checking Church-Rosser propertyIsCR
• (2+))
Top-k candidate targetsRankJoinCT
• Rank join based Top-k algorithmTopKCT
• Priority queue basedTopKCTh
• Heuristic
13
TopKCT: Brodal Queue based Top-k algorithmInput: • A Church-Rosser Specification S• Preference model (k, p(.))• A heap for each attributes A with null
values in te
Output: • The set Te of top-k
scored candidate targets w.r.t.(k, p(.))
13
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?
Hheight Hunit
Michael Jordan 200 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 198 m 17/02/1963 Chicago Bulls United Centert3
t4
Top-1:te[height, unit] = (198, cm) 7,198,200
ft, cm, m
14
TopKCT: Brodal Queue based Top-k algorithmInput: • A Church-Rosser Specification S• Preference model (k, p(.))• A heap for each attributes A with null
values in te
Output: • The set Te of top-k
scored candidate targets w.r.t.(k, p(.))
14
Early termination: Stops as soon as top-k candidate targets are found.
Instance Optimal: w.r.t. the number of visits of each heap with optimality ratio 1.
TopKCT has early termination property and is Instance Optimal.
An algorithm A is said to be instance optimal if there exists constant c1 and c2 such that
for all instances I and all algorithms in the same setting as A.
Optimality ratio
15
Experimental Study: SettingsDatasets
o Med: sale records of medicines from various storeso 10K tuples for 2.7K entries; 2.4K tuples as master data; 105 ARs
o CFP: call for papers/participation found by Googleo 503 tuples for 100 entries; 55 tuples as master data; 43 ARs
o Rest: Restaurant data*
o 246 tuples 5149 entries; 131 ARso Syn: Synthetic data generator
o 20 attributes; ARs: 75% of form (1), 25% of form (2)
Implementationo 64 bit Linux Amazon EC2 High-CPU Extra Large Instance
o 7GB of memory and 20 EC2 Compute Unites
16
Experimental Study: IsCREffectiveness of IsCR
• Complete target tuples: Complete target tuples could be deduced for over 2/3 of the entries without user interaction
• Non-null values: over 70% when both ARs of form(1) and (2) are used
17
Experimental Study: candidate targets
Computing top-k candidates
• k doesn’t have to be large: k=15 suffices for over 85% of the entries;• Master data does help, but even when it is not available, TopKCT still works well
18
Experimental Study: user interaction
User interactions
• Few rounds of interactions are needed to deduce the targets for all the entries:• at most 3 for Med and 4 for CFP
19
Experimental Study: efficiency• Efficiency
20
Experimental Study: efficiency• Efficiency
• For Syn with ||Ie|| = 1500, ||Im|| = 300 and =50, TopKCTh, TopKCT and RankJoinCT took 159ms, 271ms and 1983 ms, respectively.
21
Conclusion
SummaryA model for determining relative accuracyFundamental problemsA framework for deducing relative accuracyAlgorithms underlying the framework
Outlook1. Discovery of ARs2. Improve the accuracy of data in a database