on data dependencies in dataspaceson data dependencies in dataspaces introduction 1/24 shaoxu song...

28
On Data Dependencies in Dataspaces Shaoxu Song Tsinghua University This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC) [email protected] 2011

Upload: others

Post on 17-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces

Shaoxu Song

Tsinghua University

This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC)

[email protected]

2011

Page 2: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 1/24

Shaoxu Song [email protected]

Dataspaces

provide a co-existing system of heterogeneous data

consider three levels of elements,object : {(attribute : value)}

ExampleWe consider a dataspace with following objects,

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.

Page 3: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 2/24

Shaoxu Song [email protected]

Comparable Correspondence

Relationship between elements in heterogeneous data

metric operator‘manu ≈≤5 prod’any two respective values ofmanu andprod are saidcomparable, e.g.,Apple Inc andApple,if their edit distance is≤ 5.

matching operator‘color ⇋ color’e.g.,red andcardinal are saidmatchedas comparablecolor, viausers’ feedback

often incrementally recognized in a pay-as-you-go style

A query of(manu : Apple)

search value similar toApple in bothmanu andprod

e.g.,(manu : Apple Inc.) in t1 and(prod : Apple) in t2

Page 4: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 3/24

Shaoxu Song [email protected]

Data Dependencies

For wider applications

integrity constraints, schema design

optimizing query evaluation, capturing data inconsistency,removing data duplicates

Conventional data dependencies not directly applicable todataspaces

often defined on the equality function

functional dependencies (FDs),X → A

specify the constraint of equality between the values of twoobjects on the same attribute

e.g.,manu → addr

cannot address the comparable correspondence, in(manu,prod)or (addr,post)

Page 5: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 4/24

Shaoxu Song [email protected]

Comparable FunctionSpecify constraints on comparable attributes

θ(manu,prod) : [manu ≈≤5 manu,manu ≈≤5 prod,prod ≈≤5 prod]

Two objects are said comparable on(manu,prod) if at least one ofthese three comparison operators inθ(manu,prod) is applicable.

t1, t2 are comparable on(manu,prod), since edit distance of(t1[manu], t2[prod]) is≤ 5

t1, t3 are also comparable on(manu,prod), where(t1[manu], t3[manu]) satisfy ‘manu ≈≤5 manu’

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.

Page 6: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 5/24

Shaoxu Song [email protected]

Comparable Dependencies (CDs)

A general form of dependencies on comparable functions

ϕ1 : θ(manu,prod) → θ(addr,post)

if the manu or prod values of two products are comparable

then their correspondingaddr or post values should also becomparable

where

θ(addr,post) : [addr ≈≤9 addr,addr ≈≤9 post,post ≈≤9 post]

is another comparable function

Page 7: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 6/24

Shaoxu Song [email protected]

Application ExampleQuery optimization

consider an objectt1 as the query

to query objects having values similar to(manu : Apple Inc.)and(addr : InfiniteLoop,CA) of t1search in themanu,addr attributes specified in the query,

also search in the comparable attributesprod,post according tothe comparable functionsθ(manu,prod) andθ(addr,post)

according toϕ1, rewrite the query by using(manu,prod) only

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.

Page 8: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Introduction 7/24

Shaoxu Song [email protected]

Related Work

Metric functional dependencies (MFDs)

Xδ−→ A

equality operator in the left-hand-side

similarity operator in the right-hand-side

for violation detection

e.g.,manu 2−→ addr

Matching dependencies (MDs)

[X ≈ X] → [A ⇋ A]

similarity operator in the left-hand-side

matching operator in the right-hand-side

for record matching

e.g.,[addr ≈ addr] → [tel ⇋ tel]

Page 9: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

Outline

Introduction

Definition

Validation

Discovery

Experiment

Conclusion

Page 10: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Definition 8/24

Shaoxu Song [email protected]

Comparison Operator

We consider a general form of comparison operators, which includethe previous operators.

Let Ai ↔ij Aj denote acomparison operatorbetween two attributesAi,Aj in a dataspaceS

equality operatorAi = Aj in functional dependencies(FDs)

metric operatorAi ≈λ Aj in metric functional dependencies(MFDs)

matching operatorAi ⇋ Aj in matching dependencies(MDs)

The comparision operator indicates true, if two values satisfy thecorresponding constraint.

Page 11: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Definition 9/24

Shaoxu Song [email protected]

Syntex

A generalcomparable function

θ(Ai,Aj) : [Ai ↔ii Ai,Ai ↔ij Aj,Aj ↔jj Aj ]

specifies a comparable constraint of two values from attribute Ai or Aj,according to their corresponding comparison operators.

A comparable dependency(CD) ϕ with general comparable functionsover a dataspaceS is in the form of

ϕ :∧

θ(Ai,Aj) → θ(B1,B2)

If two objects have comparable values onAi or Aj, then they musthave comparable values onB1 or B2.

Page 12: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Definition 10/24

Shaoxu Song [email protected]

Example

Consider

ϕ4 : θ(manu,prod) → θ(tel,phn)

whereθ(tel,phn) is [tel = tel, tel = phn,phn = phn]

we have(t1, t3) ≍ LHS(ϕ4) also agree(t1, t3) ≍ RHS(ϕ4)

denoted by(t1, t3) � ϕ4.

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.

Page 13: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Definition 11/24

Shaoxu Song [email protected]

Approximate Dependencies

Due to the extremely high heterogeneity, data dependenciesmight notexactly hold in a given dataspace.

ϕ4 : θ(manu,prod) → θ(tel,phn),

e.g.,(t1, t2) ≍ LHS(ϕ4) but (t1, t2) 6≍ RHS(ϕ4)

i.e.,(t1, t2) 6� ϕ4

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.

Page 14: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Definition 12/24

Shaoxu Song [email protected]

Measure

To evaluate how a dependency “almost” holds in a data instance

Error measure

g3(ϕ,S) =|S| − max{|T| | T ⊆ S,T � ϕ}

|S|,

the minimum number of objects that have to be removed fromthe dataspaceS for a dependencyϕ to hold.

Confidence measure

conf(ϕ,S) =max{|T| | T ⊆ S,T � ϕ}

|S|.

the maximum number of objects reserved after removingminimum objects of violations with respect toϕ.

Page 15: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Definition 13/24

Shaoxu Song [email protected]

Example

ϕ4 : θ(manu,prod) → θ(tel,phn),

Error measure

{t2} is a minimum violation set w.r.t.ϕ4

such that all the remaining objects{t1, t3} satisfyϕ4

g3 = 1/3

Confidence measure

{t1, t3} a maximum keeping set w.r.t.ϕ4

conf = 2/3

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.

Page 16: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

Outline

Introduction

Definition

Validation

Discovery

Experiment

Conclusion

Page 17: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Validation 14/24

Shaoxu Song [email protected]

Validation Problem

Unfortunately, computation of error or confidence is generally hard

Given

a dataspaceS

a dependencyϕ

a measure requirementη

the validation problem is to decide whether or not the measure ofϕoverS satisfiesη.

E.g., to determine whetherg3(ϕ,S) ≤ 0.2 orconf(ϕ,S) ≥ 0.8.

TheoremThe error and confidence validation problems areNP-complete.

Page 18: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Validation 15/24

Shaoxu Song [email protected]

The Hardness

The transitivity cannot be assumed, i.e., from(t1, t2) ≍ θ(Ai,Ai) and(t2, t3) ≍ θ(Ai,Ai) it does not necessarily follow that(t1, t3) ≍ θ(Ai,Ai).

t1 :{(A1 : abc), . . . };

t2 :{(A1 : abcd), . . . };

t3 :{(A1 : abcde), . . . }.

E.g.,θ(A1,A1) : [A1 ≈≤1 A1] with edit distance as metricd

d(t1[A1], t2[A1]) = 1 ≤ 1

d(t2[A1], t3[A1]) = 1 ≤ 1,

but d(t1[A1], t3[A1]) = 2 > 1, that is,(t1, t3) 6≍ θ(A1,A1).

The efficient validation computation based on disjoint groupingcannot be applied in this case of comparable functions.

Page 19: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Validation 16/24

Shaoxu Song [email protected]

Approximation Computation

Compute an approximate error/confidence measure ofϕ overS

the approximate measure has a relative performance guaranteecompared with exact measure,

e.g.,g/g ≤ ρ

whereg is an approximation of exact error measureg andρ isapproximation ratio

Page 20: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Validation 17/24

Shaoxu Song [email protected]

Greedy Algorithm

greedily count both objects when a violation occurs

the complexity isO(n2)

The error approximation

outputs an estimateg with a boundg ≤ g ≤ 2g compared withthe exact error measureg

TheoremThe confidence has no constant-factor approximation unlessP=NP

confidence isNP-hard to approximate within a constant factor

g3 error and confidence are not equivalent in anapproximation-preserving way

Page 21: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Validation 18/24

Shaoxu Song [email protected]

Randomized Algorithm

Greedy algorithm still has to consider all the objects in a dataspace.

Randomized algorithm evaluates just a small subset of objects

randomly drawm samples

estimate the error/confidence measure by using the violations tothesemsamples

the estimate measure is still guaranteed by certain approximationbound with high probability

e.g., Pr[g ≤ ρg+ ǫ] ≥ δ

ǫ is an additive error

δ is a probability guarantee

m is determined byǫ andδ

Page 22: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

Outline

Introduction

Definition

Validation

Discovery

Experiment

Conclusion

Page 23: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Discovery 19/24

Shaoxu Song [email protected]

Discovery Problem

The strict dependency discovery problem

find a canonical cover of all dependencies that hold in data

a canonical cover can be exponential in # of attributes

high dimensionality in dataspaces (attributes and comparablefunctions)

highly non-trivial (if not impossible) to discover a canonicalcover of all dependencies

Thek-length dependencies

containk or less comparable functions

motivated by the concept of miningk-length itemsets inassociation rules

Page 24: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Discovery 20/24

Shaoxu Song [email protected]

Pay-as-you-go Discovery

In previous work of dataspaces, comparable attributes are identified inapay-as-you-gostyle.

The discovery of dependencies should be conducted in an incrementalway as well.

given a setΣ of currently discovered dependencies

and a newly identified comparable functionsθ(Ai,Aj),

we can generate new dependencies w.r.t.θ(Ai,Aj) according tothe augmentation property

Page 25: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Experiment 21/24

Shaoxu Song [email protected]

Validation EvaluationExperiments on real data sets

compute the exact/approximate measures of dependencies

observe the corresponding performance

exact computation does not scale well

approximation computation keeps significantly lower time costand scale well

0.001

0.01

0.1

1

10

100

1000

10000

300 350 400 450 500 550 600

Tim

e C

ost (

s)

# Objects

(b) time performance

GreedyRandomized

Exact

0

0.2

0.4

0.6

0.8

4k 5k 6k 7k 8k 8k 10k

Tim

e C

ost (

s)

# Objects

base

GreedyRandomized

Page 26: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Experiment 22/24

Shaoxu Song [email protected]

Discovery EvaluationIllustrate the incremental discovery of dependencies withthe increaseof functions.

y-axis is scaled logarithmicallytime cost increases heavily with the number of functionsthe intrinsic hardness in discovering dependencies with respectto attributes (and the corresponding comparable functions)sizek of functions also affects the discovery performance largely

0.01

0.1

1

10

100

1000

10 20 30 40 50 60 70

Tim

e C

ost (

s)

# Functions

(a) incremental performance

k=4k=3k=2

Page 27: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Experiment 23/24

Shaoxu Song [email protected]

Discovery Evaluation

The discovery algorithm scales well in large size of objects

greedy approximation is adopted for validation

verify the efficiency of approximation computation proposed forvalidating dependencies

0

500

1000

1500

2000

2500

4k 5k 6k 7k 8k 8k 10k

Tim

e C

ost (

s)

# Objects

base

k=4k=3k=2

Page 28: On Data Dependencies in DataspacesOn Data Dependencies in Dataspaces Introduction 1/24 Shaoxu Song sxsong@tsinghua.edu.cn Dataspaces provide a co-existing system of heterogeneous data

On Data Dependencies in Dataspaces Conclusion 24/24

Shaoxu Song [email protected]

Conclusion

This is the first work to adapt dependencies to dataspaces with theconsideration of comparable attribute values

it is already hard to tell whether a dependency almost holds inthe data

the confidence validation is also proved hard to approximatetowithin any constant factor

propose several greedy and randomized approaches forapproximately solving the validation problem

study the pay-as-you-go discovery of dependencies fromdataspaces