determining the relative accuracy of attributes

Download Determining the Relative Accuracy of Attributes

If you can't read please download the document

Post on 31-Dec-2015

21 views

Category:

Documents

1 download

Embed Size (px)

DESCRIPTION

Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 1 1 University of Edinburgh 2 Beihang University. Determining the Relative Accuracy of Attributes. Chicago Bulls. United Center. Michael. Jordan. 198cm. 17/02/1963. FD: [ FN, LN, team, height  date of birth] - PowerPoint PPT Presentation

TRANSCRIPT

XML publishing

Determining the Relative Accuracy of Attributes

Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 11University of Edinburgh 2Beihang University

1

2

3FD: [FN, LN, team, height date of birth]CFD: [team = Chicago Bulls arena = United Center]FNLNheightdate of birthteamarenaMJ7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago StadiumMichaelJordan198cm17/02/1963Chicago BullsUnited CenterInstance may be consistent, but its values may still be inaccurateApplications: Data fusion for big data, decision making, information systems, Data Accuracy: a central problem that has not been formally studiedData AccuracyFind the most accurate values for Jordan within D4FNLNheightunitdate of birthteamarenaMJ7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago StadiumFNLNteamseasonMichaelJordanChicago Bulls94-95t1t2tmMichaelJordanChicago BullsUnited CenterUsing Accuracy Rules to capture data semanticsD17/02/1963Accuracy Rules Inferring Relative Accuracy with ARs5FNLNteamseasonMichaelJordanChicago Bulls94-95tmFNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago StadiumA chase-like procedure with ARs to deduce relative accuracyt1t2DDm

143MichaelJordan17/02/1963Chicago BullsUnited CenterA chasing sequenceFNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1DmTermination Problem: Every chasing sequence always terminates.

143Dt2Finite ??(143.. j..)YestmFNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadium7

1435

Whether different chasing sequences coincide?Not always Church-Rosser Problem: The Church-Rosser property is not guaranteed. But can be checked in cubic time.t1DmDt2tmFundamental Problems: Deducing candidate targets8FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1t2DDm

143teMichaelJordan17/02/1963Chicago BullsUnited Center

??tmTarget tuple may be incomplete: the need to find candidate targetsMichaelJordan198cm17/02/1963ChicagoChicago StadiumMichaelJordan7ft17/02/1963Chicago BullsUnited Centerte'1te'2Whether candidate targets always exist?Not always Fundamental Problems: Deducing candidate targets9FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1t2DDmteMichaelJordan17/02/1963Chicago BullsUnited Center??tm

MichaelJordan198cm17/02/1963ChicagoChicago StadiumMichaelJordan7ft17/02/1963Chicago BullsUnited Centerte'1te'2(2, 6)It is NP-complete to determine whether there exist candidate targetsThere can be exponentially or even infinitely many candidate targetsFundamental Problems: Top-k candidate targets10FNLNteamseasonMichaelJordanChicago Bulls94-95FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1t2DDmteMichaelJordan17/02/1963Chicago BullsUnited Center??tmPreference model: (k, p(.)): p(.) is any monotone scoring function (e.g., occurrences)Top-k candidate targets problem: whether there exists a k-set Te such that p(Te) > C The Top-k candidate targets problem is NP-completeMichaelJordan198cm17/02/1963ChicagoChicago StadiumMichaelJordan7ft17/02/1963Chicago BullsUnited Centerte'1te'2K=2, p(Te) = 14 A framework for deducing target tuples11Is S Church-Rosser?complete te derived?Compute top-k candidate targets TePreference Model (k,p(.))feedback

Tet'eYesNoYesReturn teNoRankJoinCTTopKCTTopKCTh11Algorithms12TopKCT: Brodal Queue based Top-k algorithm13Input: A Church-Rosser Specification SPreference model (k, p(.))A heap for each attributes A with null values in te

Output: The set Te of top-k scored candidate targets w.r.t.(k, p(.))

13FNLNheightunitdate of birthteamarenaM.J.7ft1963Chicago BullsUnited CenterMichaelJordan198cm17/02/1963ChicagoChicago Stadiumt1t2DteMichaelJordan17/02/1963Chicago BullsUnited Center??HheightHunitMichaelJordan200cm17/02/1963ChicagoChicago StadiumMichaelJordan198m17/02/1963Chicago BullsUnited Centert3t4Top-1:te[height, unit] = (198, cm) 7,198,200ft, cm, mTopKCT: Brodal Queue based Top-k algorithm14Input: A Church-Rosser Specification SPreference model (k, p(.))A heap for each attributes A with null values in te

Output: The set Te of top-k scored candidate targets w.r.t.(k, p(.))

14Early termination: Stops as soon as top-k candidate targets are found.Instance Optimal: w.r.t. the number of visits of each heap with optimality ratio 1.TopKCT has early termination property and is Instance Optimal.Optimality ratioExperimental Study: SettingsDatasetsMed: sale records of medicines from various stores10K tuples for 2.7K entries; 2.4K tuples as master data; 105 ARs CFP: call for papers/participation found by Google503 tuples for 100 entries; 55 tuples as master data; 43 ARsRest: Restaurant data*246 tuples 5149 entries; 131 ARsSyn: Synthetic data generator20 attributes; ARs: 75% of form (1), 25% of form (2)Implementation64 bit Linux Amazon EC2 High-CPU Extra Large Instance7GB of memory and 20 EC2 Compute Unites15Experimental Study: IsCREffectiveness of IsCR16

Complete target tuples: Complete target tuples could be deduced for over 2/3 of the entries without user interactionNon-null values: over 70% when both ARs of form(1) and (2) are usedExperimental Study: candidate targetsComputing top-k candidates17

k doesnt have to be large: k=15 suffices for over 85% of the entries;Master data does help, but even when it is not available, TopKCT still works wellExperimental Study: user interactionUser interactions18

Few rounds of interactions are needed to deduce the targets for all the entries:at most 3 for Med and 4 for CFPExperimental Study: efficiencyEfficiency19

Experimental Study: efficiencyEfficiency20

ConclusionSummaryA model for determining relative accuracyFundamental problemsA framework for deducing relative accuracyAlgorithms underlying the frameworkOutlookDiscovery of ARsImprove the accuracy of data in a database21

Recommended

View more >