on data dependencies in dataspaceson data dependencies in dataspaces introduction 1/24 shaoxu song...

On Data Dependencies in Dataspaces

Shaoxu Song

Tsinghua University

This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC)

[email protected]

2011

On Data Dependencies in Dataspaces Introduction 1/24

Shaoxu Song [email protected]

Dataspaces

provide a co-existing system of heterogeneous data

consider three levels of elements,object : {(attribute : value)}

ExampleWe consider a dataspace with following objects,

t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),

(addr : InfiniteLoop,CA), (website : itunes.com)};

t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),

(post : InfiniteLoop,Cupert), (website : apple.com)};

t3 : {(name : iPad), (color : white), (manu : Apple Inc.),

(post : InfiniteLoop), (website : apple.com), (phn : 567)}.



Comparable Correspondence

Relationship between elements in heterogeneous data

metric operator‘manu ≈≤5 prod’any two respective values ofmanu andprod are saidcomparable, e.g.,Apple Inc andApple,if their edit distance is≤ 5.

matching operator‘color ⇋ color’e.g.,red andcardinal are saidmatchedas comparablecolor, viausers’ feedback

often incrementally recognized in a pay-as-you-go style

A query of(manu : Apple)

search value similar toApple in bothmanu andprod

e.g.,(manu : Apple Inc.) in t1 and(prod : Apple) in t2



Data Dependencies

For wider applications

integrity constraints, schema design

optimizing query evaluation, capturing data inconsistency,removing data duplicates

Conventional data dependencies not directly applicable todataspaces

often defined on the equality function

functional dependencies (FDs),X → A

specify the constraint of equality between the values of twoobjects on the same attribute

e.g.,manu → addr

cannot address the comparable correspondence, in(manu,prod)or (addr,post)



Comparable FunctionSpecify constraints on comparable attributes

θ(manu,prod) : [manu ≈≤5 manu,manu ≈≤5 prod,prod ≈≤5 prod]

Two objects are said comparable on(manu,prod) if at least one ofthese three comparison operators inθ(manu,prod) is applicable.

t1, t2 are comparable on(manu,prod), since edit distance of(t1[manu], t2[prod]) is≤ 5

t1, t3 are also comparable on(manu,prod), where(t1[manu], t3[manu]) satisfy ‘manu ≈≤5 manu’









Comparable Dependencies (CDs)

A general form of dependencies on comparable functions

ϕ1 : θ(manu,prod) → θ(addr,post)

if the manu or prod values of two products are comparable

then their correspondingaddr or post values should also becomparable

where

θ(addr,post) : [addr ≈≤9 addr,addr ≈≤9 post,post ≈≤9 post]

is another comparable function



Application ExampleQuery optimization

consider an objectt1 as the query

to query objects having values similar to(manu : Apple Inc.)and(addr : InfiniteLoop,CA) of t1search in themanu,addr attributes specified in the query,

also search in the comparable attributesprod,post according tothe comparable functionsθ(manu,prod) andθ(addr,post)

according toϕ1, rewrite the query by using(manu,prod) only









Related Work

Metric functional dependencies (MFDs)

Xδ−→ A

equality operator in the left-hand-side

similarity operator in the right-hand-side

for violation detection

e.g.,manu 2−→ addr

Matching dependencies (MDs)

[X ≈ X] → [A ⇋ A]

similarity operator in the left-hand-side

matching operator in the right-hand-side

for record matching

e.g.,[addr ≈ addr] → [tel ⇋ tel]

Outline

Introduction

Definition

Validation

Discovery

Experiment

Conclusion

On Data Dependencies in Dataspaces Definition 8/24


Comparison Operator

We consider a general form of comparison operators, which includethe previous operators.

Let Ai ↔ij Aj denote acomparison operatorbetween two attributesAi,Aj in a dataspaceS

equality operatorAi = Aj in functional dependencies(FDs)

metric operatorAi ≈λ Aj in metric functional dependencies(MFDs)

matching operatorAi ⇋ Aj in matching dependencies(MDs)

The comparision operator indicates true, if two values satisfy thecorresponding constraint.



Syntex

A generalcomparable function

θ(Ai,Aj) : [Ai ↔ii Ai,Ai ↔ij Aj,Aj ↔jj Aj ]

specifies a comparable constraint of two values from attribute Ai or Aj,according to their corresponding comparison operators.

A comparable dependency(CD) ϕ with general comparable functionsover a dataspaceS is in the form of

ϕ :∧

θ(Ai,Aj) → θ(B1,B2)

If two objects have comparable values onAi or Aj, then they musthave comparable values onB1 or B2.



Example

Consider

ϕ4 : θ(manu,prod) → θ(tel,phn)

whereθ(tel,phn) is [tel = tel, tel = phn,phn = phn]

we have(t1, t3) ≍ LHS(ϕ4) also agree(t1, t3) ≍ RHS(ϕ4)

denoted by(t1, t3) � ϕ4.









Approximate Dependencies

Due to the extremely high heterogeneity, data dependenciesmight notexactly hold in a given dataspace.

ϕ4 : θ(manu,prod) → θ(tel,phn),

e.g.,(t1, t2) ≍ LHS(ϕ4) but (t1, t2) 6≍ RHS(ϕ4)

i.e.,(t1, t2) 6� ϕ4









Measure

To evaluate how a dependency “almost” holds in a data instance

Error measure

g3(ϕ,S) =|S| − max{|T| | T ⊆ S,T � ϕ}

|S|,

the minimum number of objects that have to be removed fromthe dataspaceS for a dependencyϕ to hold.

Confidence measure

conf(ϕ,S) =max{|T| | T ⊆ S,T � ϕ}

|S|.

the maximum number of objects reserved after removingminimum objects of violations with respect toϕ.



Example

ϕ4 : θ(manu,prod) → θ(tel,phn),

Error measure

{t2} is a minimum violation set w.r.t.ϕ4

such that all the remaining objects{t1, t3} satisfyϕ4

g3 = 1/3

Confidence measure

{t1, t3} a maximum keeping set w.r.t.ϕ4

conf = 2/3







Outline

Introduction

Definition

Validation

Discovery

Experiment

Conclusion

On Data Dependencies in Dataspaces Validation 14/24


Validation Problem

Unfortunately, computation of error or confidence is generally hard

Given

a dataspaceS

a dependencyϕ

a measure requirementη

the validation problem is to decide whether or not the measure ofϕoverS satisfiesη.

E.g., to determine whetherg3(ϕ,S) ≤ 0.2 orconf(ϕ,S) ≥ 0.8.

TheoremThe error and confidence validation problems areNP-complete.



The Hardness

The transitivity cannot be assumed, i.e., from(t1, t2) ≍ θ(Ai,Ai) and(t2, t3) ≍ θ(Ai,Ai) it does not necessarily follow that(t1, t3) ≍ θ(Ai,Ai).

t1 :{(A1 : abc), . . . };

t2 :{(A1 : abcd), . . . };

t3 :{(A1 : abcde), . . . }.

E.g.,θ(A1,A1) : [A1 ≈≤1 A1] with edit distance as metricd

d(t1[A1], t2[A1]) = 1 ≤ 1

d(t2[A1], t3[A1]) = 1 ≤ 1,

but d(t1[A1], t3[A1]) = 2 > 1, that is,(t1, t3) 6≍ θ(A1,A1).

The efficient validation computation based on disjoint groupingcannot be applied in this case of comparable functions.



Approximation Computation

Compute an approximate error/confidence measure ofϕ overS

the approximate measure has a relative performance guaranteecompared with exact measure,

e.g.,g/g ≤ ρ

whereg is an approximation of exact error measureg andρ isapproximation ratio



Greedy Algorithm

greedily count both objects when a violation occurs

the complexity isO(n2)

The error approximation

outputs an estimateg with a boundg ≤ g ≤ 2g compared withthe exact error measureg

TheoremThe confidence has no constant-factor approximation unlessP=NP

confidence isNP-hard to approximate within a constant factor

g3 error and confidence are not equivalent in anapproximation-preserving way



Randomized Algorithm

Greedy algorithm still has to consider all the objects in a dataspace.

Randomized algorithm evaluates just a small subset of objects

randomly drawm samples

estimate the error/confidence measure by using the violations tothesemsamples

the estimate measure is still guaranteed by certain approximationbound with high probability

e.g., Pr[g ≤ ρg+ ǫ] ≥ δ

ǫ is an additive error

δ is a probability guarantee

m is determined byǫ andδ

Outline

Introduction

Definition

Validation

Discovery

Experiment

Conclusion

On Data Dependencies in Dataspaces Discovery 19/24


Discovery Problem

The strict dependency discovery problem

find a canonical cover of all dependencies that hold in data

a canonical cover can be exponential in # of attributes

high dimensionality in dataspaces (attributes and comparablefunctions)

highly non-trivial (if not impossible) to discover a canonicalcover of all dependencies

Thek-length dependencies

containk or less comparable functions

motivated by the concept of miningk-length itemsets inassociation rules

On Data Dependencies in Dataspaces Discovery 20/24


Pay-as-you-go Discovery

In previous work of dataspaces, comparable attributes are identified inapay-as-you-gostyle.

The discovery of dependencies should be conducted in an incrementalway as well.

given a setΣ of currently discovered dependencies

and a newly identified comparable functionsθ(Ai,Aj),

we can generate new dependencies w.r.t.θ(Ai,Aj) according tothe augmentation property

On Data Dependencies in Dataspaces Experiment 21/24


Validation EvaluationExperiments on real data sets

compute the exact/approximate measures of dependencies

observe the corresponding performance

exact computation does not scale well

approximation computation keeps significantly lower time costand scale well

0.001

0.01

0.1

1

10

100

1000

10000

300 350 400 450 500 550 600

Tim

e C

ost (

s)

# Objects

(b) time performance

GreedyRandomized

Exact

0

0.2

0.4

0.6

0.8

4k 5k 6k 7k 8k 8k 10k

Tim

e C

ost (

s)

# Objects

base

GreedyRandomized



Discovery EvaluationIllustrate the incremental discovery of dependencies withthe increaseof functions.

y-axis is scaled logarithmicallytime cost increases heavily with the number of functionsthe intrinsic hardness in discovering dependencies with respectto attributes (and the corresponding comparable functions)sizek of functions also affects the discovery performance largely

0.01

0.1

1

10

100

1000

10 20 30 40 50 60 70

Tim

e C

ost (

s)

# Functions

(a) incremental performance

k=4k=3k=2



Discovery Evaluation

The discovery algorithm scales well in large size of objects

greedy approximation is adopted for validation

verify the efficiency of approximation computation proposed forvalidating dependencies

0

500

1000

1500

2000

2500

4k 5k 6k 7k 8k 8k 10k

Tim

e C

ost (

s)

# Objects

base

k=4k=3k=2

On Data Dependencies in Dataspaces Conclusion 24/24


Conclusion

This is the first work to adapt dependencies to dataspaces with theconsideration of comparable attribute values

it is already hard to tell whether a dependency almost holds inthe data

the confidence validation is also proved hard to approximatetowithin any constant factor

propose several greedy and randomized approaches forapproximately solving the validation problem

study the pay-as-you-go discovery of dependencies fromdataspaces

on data dependencies in dataspaceson data dependencies in dataspaces introduction 1/24 shaoxu song...

Documents