Transcript
Page 1: Determining the Relative Accuracy of Attributes

Determining the Relative Accuracy of Attributes

Yang Cao 1, 2 Wenfei Fan 1, 2 Wenyuan Yu 1

1University of Edinburgh 2Beihang University

Page 2: Determining the Relative Accuracy of Attributes

2

Page 3: Determining the Relative Accuracy of Attributes

3

FD: [FN, LN, team, height date of birth]CFD: [team = “Chicago Bulls” arena = “United Center”]

FN LN height date of birth team arena

M J 7ft 1963 Chicago Bulls United Center

Michael Jordan 198cm 17/02/1963 Chicago Chicago StadiumMichael Jordan 198cm 17/02/1963

Chicago Bulls United Center

Instance may be consistent, but its values may still be inaccurate

Applications: Data fusion for big data, decision making, information systems, …

Data Accuracy: a central problem that has not been formally studied

Data Accuracy

Find the most accurate values for Jordan within D

Page 4: Determining the Relative Accuracy of Attributes

4

FN LN height unit date of birth team arena

M J 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium

FN LN team season

Michael Jordan Chicago Bulls 94-95

… … … …

t1

t2

tm

Michael Jordan

Chicago Bulls United Center

Form(1)

Form (2)

Using Accuracy Rules to capture data semantics

D

𝑡𝑚 [𝐹𝑁 ]=𝑡𝑒 [𝐹𝑁 ]∧𝑡𝑚 [𝐿𝑁 ]=𝑡𝑒 [𝐿𝑁 ]❑→

𝑡𝑒 [𝑡𝑒𝑎𝑚 ]=𝑡𝑚 [𝑡𝑒𝑎𝑚 ]

17/02/1963

𝑡1 [𝐹𝑁 ,𝐿𝑁 , h𝑏𝑖𝑟𝑡 ]<𝑡 2 [𝐹𝑁 ,𝐿𝑁 , h𝑏𝑖𝑟𝑡 ]❑→

𝑡 1≼𝐹 ,𝐿,𝑏𝑡 2

𝑡1≺𝑢𝑛𝑖𝑡 𝑡2❑→

𝑡1≼h h𝑒𝑖𝑔 𝑡 𝑡2 𝑡1≺𝑡𝑒𝑎𝑚𝑡 2❑→

𝑡 1≼𝑎𝑟𝑒𝑛𝑎𝑡 2

Accuracy Rules

Page 5: Determining the Relative Accuracy of Attributes

Inferring Relative Accuracy with ARs

5

FN LN team season

Michael Jordan Chicago Bulls 94-95tm

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium

A chase-like procedure with ARs to deduce relative accuracy

t1

t2

D

Dm

0 [ ] _De it A 0 [FN] "Michael"D

et

1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟐 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

1 FN/LN/birth 2t t

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟑 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

2 arena 1t t

0 [arena] "Unit.. C.."Det

1 FN/LN/birth 2t t:

φ4:

φ1:

φ2: φ3:

φ1φ4

φ3

Michael Jordan 17/02/1963

Chicago Bulls United Center

A chasing sequence

Page 6: Determining the Relative Accuracy of Attributes

FN LN team season

Michael Jordan Chicago Bulls 94-95

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1

Dm

φ1:

φ2: φ3:

Termination Problem: Every chasing sequence always terminates.

0 [ ] _De it A 0 [FN] "Michael"D

et

1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟐 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

1 FN/LN/birth 2t t

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟑 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

2 arena 1t t

0 [arena] "Unit.. C.."Det

1 FN/LN/birth 2t t:

φ1φ4

φ3

𝛴

D

t2

φ4:

Finite ??(φ1φ4φ3.. φj..)

YES

tm

Page 7: Determining the Relative Accuracy of Attributes

FN LN team season

Michael Jordan Chicago Bulls 94-95

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium

φ1:

φ2: φ3:

φ4:

7

0 [ ] _De it A 0 [FN] "Michael"D

et

1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟐 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

1 FN/LN/birth 2t t

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟑 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

2 arena 1t t

0 [arena] "Unit.. C.."Det

1 FN/LN/birth 2t t:

φ1φ4

φ3

φ5:

φ5

team1 2t t

0 [team] " "ChicagoDet

team1 2t t arena1 2t t

0 [team] " "ChicagoDet

0 [arena] " "Ch.. St..Det

Whether different chasing sequences coincide?NOT ALWAYS

Church-Rosser Problem: The Church-Rosser property is not guaranteed. But can be checked in cubic time.

t1

Dm

𝛴

D

t2

tm

Page 8: Determining the Relative Accuracy of Attributes

Fundamental Problems: Deducing candidate targets

8

FN LN team season

Michael Jordan Chicago Bulls 94-95

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1

t2

D

Dm

φ4:

φ1:

φ2: φ3: Σ

0 [ ] _De it A 0 [FN] "Michael"D

et

1 FN/LN/birth 2t tiA 𝑫𝟎 : 𝑫𝟏 :

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟐 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

0 [team] "Chi.. B.."Det

2 team 1t t𝑫𝟑 :

0 [LN] "Jordan"Det

0 [birth] "17 / .."Det

0 [FN] "Michael"Det

2 arena 1t t

0 [arena] "Unit.. C.."Det

1 FN/LN/birth 2t t:

φ1φ4

φ3

teMichael Jordan 17/02/1963 Chicago Bulls United Center

1 FN/LN/birth 2t t

? ?

tm

Target tuple may be incomplete: the need to find candidate targets

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium

Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1

te'2

Whether candidate targets always exist?NOT ALWAYS

Page 9: Determining the Relative Accuracy of Attributes

9

Fundamental Problems: Deducing candidate targets

FN LN team season

Michael Jordan Chicago Bulls 94-95

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1

t2

D

Dm

φ4:

φ1:

φ2: φ3:

teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?

tm

φ6:

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium

Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1

te'2(φ2, φ6)

It is NP-complete to determine whether there exist candidate targets

There can be exponentially or even infinitely many candidate targets

Σ

Page 10: Determining the Relative Accuracy of Attributes

10

Fundamental Problems: Top-k candidate targets

FN LN team season

Michael Jordan Chicago Bulls 94-95

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1

t2

D

Dm

φ4:

φ1:

φ2: φ3:

teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?

tm

Σ

Preference model: (k, p(.)): p(.) is any monotone scoring function (e.g., occurrences)

Top-k candidate targets problem: whether there exists a k-set Te such that p(Te) > C

The Top-k candidate targets problem is NP-complete

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadium

Michael Jordan 7 ft 17/02/1963 Chicago Bulls United Centerte'1

te'2

K=2, p(Te) = 14

Page 11: Determining the Relative Accuracy of Attributes

11

A framework for deducing target tuples

𝑆=(𝐷0 , Σ , 𝐼𝑚 , 𝑡𝑒❑𝐷0

) Is S Church-Rosser?

complete te derived?

Compute top-k candidate targets

Te

Preference Model (k,p(.))

feedback

Te

t'e

Yes

No

YesReturn te

No • IsCR

• RankJoinCT• TopKCT• TopKCTh

Page 12: Determining the Relative Accuracy of Attributes

12

Algorithms

Checking Church-Rosser propertyIsCR

• (2+))

Top-k candidate targetsRankJoinCT

• Rank join based Top-k algorithmTopKCT

• Priority queue basedTopKCTh

• Heuristic

Page 13: Determining the Relative Accuracy of Attributes

13

TopKCT: Brodal Queue based Top-k algorithmInput: • A Church-Rosser Specification S• Preference model (k, p(.))• A heap for each attributes A with null

values in te

Output: • The set Te of top-k

scored candidate targets w.r.t.(k, p(.))

13

FN LN height unit date of birth team arena

M. J. 7 ft 1963 Chicago Bulls United Center

Michael Jordan 198 cm 17/02/1963 Chicago Chicago Stadiumt1

t2

D

teMichael Jordan 17/02/1963 Chicago Bulls United Center? ?

Hheight Hunit

Michael Jordan 200 cm 17/02/1963 Chicago Chicago Stadium

Michael Jordan 198 m 17/02/1963 Chicago Bulls United Centert3

t4

Top-1:te[height, unit] = (198, cm) 7,198,200

ft, cm, m

Page 14: Determining the Relative Accuracy of Attributes

14

TopKCT: Brodal Queue based Top-k algorithmInput: • A Church-Rosser Specification S• Preference model (k, p(.))• A heap for each attributes A with null

values in te

Output: • The set Te of top-k

scored candidate targets w.r.t.(k, p(.))

14

Early termination: Stops as soon as top-k candidate targets are found.

Instance Optimal: w.r.t. the number of visits of each heap with optimality ratio 1.

TopKCT has early termination property and is Instance Optimal.

An algorithm A is said to be instance optimal if there exists constant c1 and c2 such that

for all instances I and all algorithms in the same setting as A.

Optimality ratio

Page 15: Determining the Relative Accuracy of Attributes

15

Experimental Study: SettingsDatasets

o Med: sale records of medicines from various storeso 10K tuples for 2.7K entries; 2.4K tuples as master data; 105 ARs

o CFP: call for papers/participation found by Googleo 503 tuples for 100 entries; 55 tuples as master data; 43 ARs

o Rest: Restaurant data*

o 246 tuples 5149 entries; 131 ARso Syn: Synthetic data generator

o 20 attributes; ARs: 75% of form (1), 25% of form (2)

Implementationo 64 bit Linux Amazon EC2 High-CPU Extra Large Instance

o 7GB of memory and 20 EC2 Compute Unites

Page 16: Determining the Relative Accuracy of Attributes

16

Experimental Study: IsCREffectiveness of IsCR

• Complete target tuples: Complete target tuples could be deduced for over 2/3 of the entries without user interaction

• Non-null values: over 70% when both ARs of form(1) and (2) are used

Page 17: Determining the Relative Accuracy of Attributes

17

Experimental Study: candidate targets

Computing top-k candidates

• k doesn’t have to be large: k=15 suffices for over 85% of the entries;• Master data does help, but even when it is not available, TopKCT still works well

Page 18: Determining the Relative Accuracy of Attributes

18

Experimental Study: user interaction

User interactions

• Few rounds of interactions are needed to deduce the targets for all the entries:• at most 3 for Med and 4 for CFP

Page 19: Determining the Relative Accuracy of Attributes

19

Experimental Study: efficiency• Efficiency

Page 20: Determining the Relative Accuracy of Attributes

20

Experimental Study: efficiency• Efficiency

• For Syn with ||Ie|| = 1500, ||Im|| = 300 and =50, TopKCTh, TopKCT and RankJoinCT took 159ms, 271ms and 1983 ms, respectively.

Page 21: Determining the Relative Accuracy of Attributes

21

Conclusion

SummaryA model for determining relative accuracyFundamental problemsA framework for deducing relative accuracyAlgorithms underlying the framework

Outlook1. Discovery of ARs2. Improve the accuracy of data in a database


Top Related