cs345 compact skeletons. assume tuples components are scattered over website we have a tagger that...

45
CS345 CS345 Compact Skeletons

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

CS345CS345

Compact Skeletons

Page 2: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Compact SkeletonsCompact Skeletons

Assume tuples components are scattered over website

We have a tagger that can tag all tuple components on website– Assume no noise for now

Reconstruct relation

Page 3: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Website

Data Graph

Skeleton

Relation

Compact SkeletonsCompact Skeletons

Page 4: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Dept (D)

Title (T)

Salary (S)

Address (A)

Welcome to Big Corp!

Join our team.Jobs are available in these departments:

R&D

Corporate The following jobs are open:

Job #12345

Job #12346

Send resumes to:1200 Jose Blvd, CA 94123Job Title:

Salary: Must know Java…..

Programmer100K

Page 5: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Dept (D)

Title (T)

Salary (S)

Address (A)

Welcome to Big Corp!

Join our team.Jobs are available in these departments:

R&D

Corporate The following jobs are open:

Job #12345

Job #12346

Send resumes to:1200 Jose Blvd, CA 94123Job Title:

Salary: Must know Java…..

Programmer100K

Page 6: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K

1200 Jose Blvd

Corporate

Page 7: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Corporate

CEOAdminAsst

400 7th Ave

60K

Programmer 100K R &D 1200 Jose BlvdCTO 150K R & D 1200 Jose BlvdAdmin Asst 60K Corporate 400 7th AveCEO (null) Corporate 400 7th Ave

T S D A

Page 8: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Corporate

CEOAdminAsst

60K

Programmer 100K R &D 1200 Jose BlvdCTO 150K R & D 1200 Jose BlvdAdmin Asst 60K Corporate 1200 Jose BlvdCEO (null) Corporate 1200 Jose Blvd

T S D A

Page 9: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Website

Data Graph

Skeleton

Relation

Page 10: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

SkeletonsSkeletons

Labeled treesTransformation from data graphs to

relations

D

T S

A

D

T S

A

Page 11: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

D

T S

A

Overlays

Page 12: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

D

T S

A

Overlays

T S D A

Programmer 100K R &D 1200 Jose Blvd

Page 13: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

D

T S

A

Overlays

T S D A

Programmer 100K R &D 1200 Jose Blvd

CTO 150K R &D 1200 Jose Blvd

Page 14: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Overlays

Page 15: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

T S D A

Programmer 150K R &D 1200 Jose Blvd

Overlays

Page 16: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

T S D A

Programmer 150K R &D 1200 Jose Blvd

CTO 100K R & D 1200 Jose Blvd

Overlays

Page 17: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Inconsistent Overlays

Page 18: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S

A

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

Inconsistent Overlays

Page 19: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Compact SkeletonsCompact Skeletons

A skeleton is compact if all overlays are consistent

Perfect if each node and edge of data graph is covered by at least one overlay

Given a data graph G, does G have a Perfect Compact Skeleton (PCS)?– Not always– But if it exists it is unique

Page 20: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R & D

Programmer 100K CTO 150K

1200 Jose Blvd

PCS Algorithm

Page 21: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S T S

A

PCS Algorithm

Work bottom-up:Compute node signaturesPlace nodes in equivalence classes based on signatureConstruct skeleton from equivalence classes

Page 22: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S T S

A

T S

A

D

PCS Algorithm

Page 23: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Corporate

CEOAdminAsst

400 7th Ave

60K

D

T S

A

Incomplete information

Page 24: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Corporate

CEOAdminAsst

400 7th Ave

60K

D

T S

A

T S D AAdmin Asst 60K Corporate 400 7th Ave

Incomplete information

Page 25: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

D

T S

A

Corporate

CEOAdminAsst

400 7th Ave

60K

T S D AAdmin Asst 60K Corporate 400 7th Ave

Incomplete information

CEO Corporate 400 7th Ave

Page 26: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Partial Compact SkeletonsPartial Compact Skeletons

For data graphs with incomplete information, we allow partial overlays– Results in nulls in relation

If we can use consistent partial overlays to cover every node and edge of the graph, we have a partially perfect compact skeleton (PPCS)

Page 27: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Tuple subsumptionTuple subsumption

Tuple t subsumes tuple u if t and u agree on every component of u that is not null

tu

Page 28: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Noisy Data GraphsNoisy Data Graphs

Real-life websites are noisy– False positives e.g., MS = degree, state or

Microsoft?– Non-skeleton links e.g., featured products

Page 29: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Data graph for a retail websiteData graph for a retail website

c1 c2 c3

p1 p3 p4 a3a1 a2

i1 i2 i3 i4a4 C: CategoryI: ItemP: PriceA: Availability

C

I

Skeleton K1

P A

For simplicity: assume all nodes have a label

Page 30: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Coverage of a skeletonCoverage of a skeleton

c1 c2 c3

p1 p3 p4 a3

C

I

Skeleton K1

a1 a2

i1 i2 i3 i4a4

P A

Page 31: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Coverage of a skeletonCoverage of a skeleton

c1 c2 c3

p1 p3 p4 a3

C

I

Skeleton K1Coverage = 28

a1 a2

i1 i2 i3 i4a4

P A

Page 32: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Coverage of a skeletonCoverage of a skeleton

c1 c2 c3

p1 p3 p4 a3

C

I

C I

Skeleton K1Coverage = 28

Skeleton K2Coverage = 12

a1 a2

i1 i2 i3 i4a4

P A

A P

Page 33: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Skeletons for Noisy Data GraphsSkeletons for Noisy Data Graphs

Problem:– Find skeleton K with optimal coverage,

called the best-fit skeleton (BFS)NP-complete

Page 34: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Greedy Heuristic for BFSGreedy Heuristic for BFS

c1 c2 c3

p1 p3 p4 a3a1 a2

i1 i2 i3 i4a4

r

Page 35: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Greedy Heuristic for BFSGreedy Heuristic for BFS

C C C

P P P A CA A

I I I IA

R

R

Label Parent Count

P I 3

A I 3

C 1

I C 4

R 1

C R 1

I

P A

Page 36: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R

A B

CC C

D D D D

Label Parent Count

D C 4

C A 2

B 1

A R 1

B R 1

C

R

A

D

B

Greedy skeleton

Page 37: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R

A B

CC C

D D D D

C

R

A

D

B

Greedy skeletonCoverage = 9

Page 38: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R

A B

CC C

D D D D

C

R

A

D

B

Greedy skeletonCoverage = 9

C

R

A

D

B

Optimal skeletonCoverage = 15

Page 39: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Weighted Greedy Heuristic Weighted Greedy Heuristic

Simple Greedy heuristic uses parent counts– “Memory-less”

Weighted Greedy heuristic takes into account past selections to improve simple greedy selection– Computes “benefit” of each decision at

every stage

Page 40: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

D

Weighted Greedy

Page 41: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

D

Weighted Greedy

A C ) = 4benefit(

Page 42: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

D

Weighted Greedy

A C ) = 4benefit((B C ) = 10benefit

Page 43: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

C

R

A

D

B

Greedy skeletonCoverage = 9

R

A B

CC C

D D D D

C

R

A

D

B

Weighted Greedy

Page 44: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

R

A B

CC C

D D D D

C

R

A

D

B

Greedy skeletonCoverage = 9

C

R

A

D

B

Weighted greedy skeletonCoverage = 15

Page 45: CS345 Compact Skeletons. Assume tuples components are scattered over website We have a tagger that can tag all tuple components on website –Assume no

Website

Data Graph

Compact Skeleton

Relation

SummarySummary