determining the relative accuracy of attributes
DESCRIPTION
Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 1 1 University of Edinburgh 2 Beihang University. Determining the Relative Accuracy of Attributes. Chicago Bulls. United Center. Michael. Jordan. 198cm. 17/02/1963. FD: [ FN, LN, team, height date of birth] - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/1.jpg)
Determining the Relative Accuracy of Attributes
Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 1
1University of Edinburgh 2Beihang University
![Page 2: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/2.jpg)
2
![Page 3: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/3.jpg)
3
FD: [FN, LN, team, height date of birth]CFD: [team = “Chicago Bulls” arena = “United Center”]
FN LN height date of birth team arena
M J 7ft 1963 Chicago Bulls United Center
Michael Jordan 198cm 17/02/1963 Chicago Chicago StadiumMichael Jordan 198cm 17/02/1963
Chicago Bulls United Center
Instance may be consistent, but its values may still be inaccurate
Applications: Data fusion for big data, decision making, information systems, …
Data Accuracy: a central problem that has not been formally studied
Data Accuracy
Find the most accurate values for Jordan within D
![Page 4: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/4.jpg)
4
FN LN height unit date of birth team arena
M J 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
FN LN team season
Michael Jordan Chicago Bulls 94-95
… … … …
t1
t2
tm
Michael Jordan
Chicago Bulls United Center
Form(1)
Form (2)
Using Accuracy Rules to capture data semantics
D
𝑡𝑚 [𝐹𝑁 ]=𝑡𝑒 [𝐹𝑁 ]∧𝑡𝑚 [𝐿𝑁 ]=𝑡𝑒 [𝐿𝑁 ]❑→
𝑡𝑒 [𝑡𝑒𝑎𝑚 ]=𝑡𝑚 [𝑡𝑒𝑎𝑚 ]
17/02/1963
𝑡1 [𝐹𝑁 ,𝐿𝑁 , h𝑏𝑖𝑟𝑡 ]<𝑡 2 [𝐹𝑁 ,𝐿𝑁 , h𝑏𝑖𝑟𝑡 ]❑→
𝑡 1≼𝐹 ,𝐿,𝑏𝑡 2
𝑡1≺𝑢𝑛𝑖𝑡 𝑡2❑→
𝑡1≼h h𝑒𝑖𝑔 𝑡 𝑡2 𝑡1≺𝑡𝑒𝑎𝑚𝑡 2❑→
𝑡 1≼𝑎𝑟𝑒𝑛𝑎𝑡 2
Accuracy Rules
![Page 5: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/5.jpg)
Inferring Relative Accuracy with ARs
5
FN LN team season
Michael Jordan Chicago Bulls 94-95tm
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
A chase-like procedure with ARs to deduce relative accuracy
t1
t2
D
Dm
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
1 FN/LN/birth 2t t
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ4:
φ1:
φ2: φ3:
φ1φ4
φ3
Michael Jordan 17/02/1963
Chicago Bulls United Center
A chasing sequence
![Page 6: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/6.jpg)
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
Dm
φ1:
φ2: φ3:
Termination Problem: Every chasing sequence always terminates.
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
1 FN/LN/birth 2t t
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ1φ4
φ3
𝛴
D
t2
φ4:
Finite ??(φ1φ4φ3.. φj..)
YES
tm
![Page 7: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/7.jpg)
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
φ1:
φ2: φ3:
φ4:
7
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
1 FN/LN/birth 2t t
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ1φ4
φ3
φ5:
φ5
team1 2t t
0 [team] " "ChicagoDet
team1 2t t arena1 2t t
0 [team] " "ChicagoDet
0 [arena] " "Ch.. St..Det
Whether different chasing sequences coincide?NOT ALWAYS
Church-Rosser Problem: The Church-Rosser property is not guaranteed. But can be checked in cubic time.
t1
Dm
𝛴
D
t2
tm
![Page 8: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/8.jpg)
Fundamental Problems: Deducing candidate targets
8
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
Dm
φ4:
φ1:
φ2: φ3: Σ
0 [ ] _De it A 0 [FN] "Michael"D
et
1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟐 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
0 [team] "Chi.. B.."Det
2 team 1t t𝑫𝟑 :
0 [LN] "Jordan"Det
0 [birth] "17 / .."Det
0 [FN] "Michael"Det
2 arena 1t t
0 [arena] "Unit.. C.."Det
1 FN/LN/birth 2t t:
φ1φ4
φ3
teMichael Jordan 17/02/1963 Chicago Bulls United Center
1 FN/LN/birth 2t t
? ?
tm
Target tuple may be incomplete: the need to find candidate targets
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1
te'2
Whether candidate targets always exist?NOT ALWAYS
![Page 9: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/9.jpg)
9
Fundamental Problems: Deducing candidate targets
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
Dm
φ4:
φ1:
φ2: φ3:
teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?
tm
φ6:
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1
te'2(φ2, φ6)
It is NP-complete to determine whether there exist candidate targets
There can be exponentially or even infinitely many candidate targets
Σ
![Page 10: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/10.jpg)
10
Fundamental Problems: Top-k candidate targets
FN LN team season
Michael Jordan Chicago Bulls 94-95
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
Dm
φ4:
φ1:
φ2: φ3:
teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?
tm
Σ
Preference model: (k, p(.)): p(.) is any monotone scoring function (e.g., occurrences)
Top-k candidate targets problem: whether there exists a k-set Te such that p(Te) > C
The Top-k candidate targets problem is NP-complete
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1
te'2
K=2, p(Te) = 14
![Page 11: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/11.jpg)
11
A framework for deducing target tuples
𝑆=(𝐷0 , Σ , 𝐼𝑚 , 𝑡𝑒❑𝐷0
❑
) Is S Church-Rosser?
complete te derived?
Compute top-k candidate targets
Te
Preference Model (k,p(.))
feedback
Te
t'e
Yes
No
YesReturn te
No • IsCR
• RankJoinCT• TopKCT• TopKCTh
![Page 12: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/12.jpg)
12
Algorithms
Checking Church-Rosser propertyIsCR
• (2+))
Top-k candidate targetsRankJoinCT
• Rank join based Top-k algorithmTopKCT
• Priority queue basedTopKCTh
• Heuristic
![Page 13: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/13.jpg)
13
TopKCT: Brodal Queue based Top-k algorithmInput: • A Church-Rosser Specification S• Preference model (k, p(.))• A heap for each attributes A with null
values in te
Output: • The set Te of top-k
scored candidate targets w.r.t.(k, p(.))
13
FN LN height unit date of birth team arena
M. J. 7 ft 1963 Chicago Bulls United Center
Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1
t2
D
teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?
Hheight Hunit
Michael Jordan 200 cm 17/02/1963 Chicago Chicago Stadium
Michael Jordan 198 m 17/02/1963 Chicago Bulls United Centert3
t4
Top-1:te[height, unit] = (198, cm) 7,198,200
ft, cm, m
![Page 14: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/14.jpg)
14
TopKCT: Brodal Queue based Top-k algorithmInput: • A Church-Rosser Specification S• Preference model (k, p(.))• A heap for each attributes A with null
values in te
Output: • The set Te of top-k
scored candidate targets w.r.t.(k, p(.))
14
Early termination: Stops as soon as top-k candidate targets are found.
Instance Optimal: w.r.t. the number of visits of each heap with optimality ratio 1.
TopKCT has early termination property and is Instance Optimal.
An algorithm A is said to be instance optimal if there exists constant c1 and c2 such that
for all instances I and all algorithms in the same setting as A.
Optimality ratio
![Page 15: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/15.jpg)
15
Experimental Study: SettingsDatasets
o Med: sale records of medicines from various storeso 10K tuples for 2.7K entries; 2.4K tuples as master data; 105 ARs
o CFP: call for papers/participation found by Googleo 503 tuples for 100 entries; 55 tuples as master data; 43 ARs
o Rest: Restaurant data*
o 246 tuples 5149 entries; 131 ARso Syn: Synthetic data generator
o 20 attributes; ARs: 75% of form (1), 25% of form (2)
Implementationo 64 bit Linux Amazon EC2 High-CPU Extra Large Instance
o 7GB of memory and 20 EC2 Compute Unites
![Page 16: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/16.jpg)
16
Experimental Study: IsCREffectiveness of IsCR
• Complete target tuples: Complete target tuples could be deduced for over 2/3 of the entries without user interaction
• Non-null values: over 70% when both ARs of form(1) and (2) are used
![Page 17: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/17.jpg)
17
Experimental Study: candidate targets
Computing top-k candidates
• k doesn’t have to be large: k=15 suffices for over 85% of the entries;• Master data does help, but even when it is not available, TopKCT still works well
![Page 18: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/18.jpg)
18
Experimental Study: user interaction
User interactions
• Few rounds of interactions are needed to deduce the targets for all the entries:• at most 3 for Med and 4 for CFP
![Page 19: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/19.jpg)
19
Experimental Study: efficiency• Efficiency
![Page 20: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/20.jpg)
20
Experimental Study: efficiency• Efficiency
• For Syn with ||Ie|| = 1500, ||Im|| = 300 and =50, TopKCTh, TopKCT and RankJoinCT took 159ms, 271ms and 1983 ms, respectively.
![Page 21: Determining the Relative Accuracy of Attributes](https://reader038.vdocuments.mx/reader038/viewer/2022110211/56813095550346895d966fdb/html5/thumbnails/21.jpg)
21
Conclusion
SummaryA model for determining relative accuracyFundamental problemsA framework for deducing relative accuracyAlgorithms underlying the framework
Outlook1. Discovery of ARs2. Improve the accuracy of data in a database