on data dependencies in dataspaceson data dependencies in dataspaces introduction 1/24 shaoxu song...
TRANSCRIPT
On Data Dependencies in Dataspaces
Shaoxu Song
Tsinghua University
This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC)
2011
On Data Dependencies in Dataspaces Introduction 1/24
Shaoxu Song [email protected]
Dataspaces
provide a co-existing system of heterogeneous data
consider three levels of elements,object : {(attribute : value)}
ExampleWe consider a dataspace with following objects,
t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),
(addr : InfiniteLoop,CA), (website : itunes.com)};
t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),
(post : InfiniteLoop,Cupert), (website : apple.com)};
t3 : {(name : iPad), (color : white), (manu : Apple Inc.),
(post : InfiniteLoop), (website : apple.com), (phn : 567)}.
On Data Dependencies in Dataspaces Introduction 2/24
Shaoxu Song [email protected]
Comparable Correspondence
Relationship between elements in heterogeneous data
metric operator‘manu ≈≤5 prod’any two respective values ofmanu andprod are saidcomparable, e.g.,Apple Inc andApple,if their edit distance is≤ 5.
matching operator‘color ⇋ color’e.g.,red andcardinal are saidmatchedas comparablecolor, viausers’ feedback
often incrementally recognized in a pay-as-you-go style
A query of(manu : Apple)
search value similar toApple in bothmanu andprod
e.g.,(manu : Apple Inc.) in t1 and(prod : Apple) in t2
On Data Dependencies in Dataspaces Introduction 3/24
Shaoxu Song [email protected]
Data Dependencies
For wider applications
integrity constraints, schema design
optimizing query evaluation, capturing data inconsistency,removing data duplicates
Conventional data dependencies not directly applicable todataspaces
often defined on the equality function
functional dependencies (FDs),X → A
specify the constraint of equality between the values of twoobjects on the same attribute
e.g.,manu → addr
cannot address the comparable correspondence, in(manu,prod)or (addr,post)
On Data Dependencies in Dataspaces Introduction 4/24
Shaoxu Song [email protected]
Comparable FunctionSpecify constraints on comparable attributes
θ(manu,prod) : [manu ≈≤5 manu,manu ≈≤5 prod,prod ≈≤5 prod]
Two objects are said comparable on(manu,prod) if at least one ofthese three comparison operators inθ(manu,prod) is applicable.
t1, t2 are comparable on(manu,prod), since edit distance of(t1[manu], t2[prod]) is≤ 5
t1, t3 are also comparable on(manu,prod), where(t1[manu], t3[manu]) satisfy ‘manu ≈≤5 manu’
t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),
(addr : InfiniteLoop,CA), (website : itunes.com)};
t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),
(post : InfiniteLoop,Cupert), (website : apple.com)};
t3 : {(name : iPad), (color : white), (manu : Apple Inc.),
(post : InfiniteLoop), (website : apple.com), (phn : 567)}.
On Data Dependencies in Dataspaces Introduction 5/24
Shaoxu Song [email protected]
Comparable Dependencies (CDs)
A general form of dependencies on comparable functions
ϕ1 : θ(manu,prod) → θ(addr,post)
if the manu or prod values of two products are comparable
then their correspondingaddr or post values should also becomparable
where
θ(addr,post) : [addr ≈≤9 addr,addr ≈≤9 post,post ≈≤9 post]
is another comparable function
On Data Dependencies in Dataspaces Introduction 6/24
Shaoxu Song [email protected]
Application ExampleQuery optimization
consider an objectt1 as the query
to query objects having values similar to(manu : Apple Inc.)and(addr : InfiniteLoop,CA) of t1search in themanu,addr attributes specified in the query,
also search in the comparable attributesprod,post according tothe comparable functionsθ(manu,prod) andθ(addr,post)
according toϕ1, rewrite the query by using(manu,prod) only
t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),
(addr : InfiniteLoop,CA), (website : itunes.com)};
t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),
(post : InfiniteLoop,Cupert), (website : apple.com)};
t3 : {(name : iPad), (color : white), (manu : Apple Inc.),
(post : InfiniteLoop), (website : apple.com), (phn : 567)}.
On Data Dependencies in Dataspaces Introduction 7/24
Shaoxu Song [email protected]
Related Work
Metric functional dependencies (MFDs)
Xδ−→ A
equality operator in the left-hand-side
similarity operator in the right-hand-side
for violation detection
e.g.,manu 2−→ addr
Matching dependencies (MDs)
[X ≈ X] → [A ⇋ A]
similarity operator in the left-hand-side
matching operator in the right-hand-side
for record matching
e.g.,[addr ≈ addr] → [tel ⇋ tel]
Outline
Introduction
Definition
Validation
Discovery
Experiment
Conclusion
On Data Dependencies in Dataspaces Definition 8/24
Shaoxu Song [email protected]
Comparison Operator
We consider a general form of comparison operators, which includethe previous operators.
Let Ai ↔ij Aj denote acomparison operatorbetween two attributesAi,Aj in a dataspaceS
equality operatorAi = Aj in functional dependencies(FDs)
metric operatorAi ≈λ Aj in metric functional dependencies(MFDs)
matching operatorAi ⇋ Aj in matching dependencies(MDs)
The comparision operator indicates true, if two values satisfy thecorresponding constraint.
On Data Dependencies in Dataspaces Definition 9/24
Shaoxu Song [email protected]
Syntex
A generalcomparable function
θ(Ai,Aj) : [Ai ↔ii Ai,Ai ↔ij Aj,Aj ↔jj Aj ]
specifies a comparable constraint of two values from attribute Ai or Aj,according to their corresponding comparison operators.
A comparable dependency(CD) ϕ with general comparable functionsover a dataspaceS is in the form of
ϕ :∧
θ(Ai,Aj) → θ(B1,B2)
If two objects have comparable values onAi or Aj, then they musthave comparable values onB1 or B2.
On Data Dependencies in Dataspaces Definition 10/24
Shaoxu Song [email protected]
Example
Consider
ϕ4 : θ(manu,prod) → θ(tel,phn)
whereθ(tel,phn) is [tel = tel, tel = phn,phn = phn]
we have(t1, t3) ≍ LHS(ϕ4) also agree(t1, t3) ≍ RHS(ϕ4)
denoted by(t1, t3) � ϕ4.
t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),
(addr : InfiniteLoop,CA), (website : itunes.com)};
t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),
(post : InfiniteLoop,Cupert), (website : apple.com)};
t3 : {(name : iPad), (color : white), (manu : Apple Inc.),
(post : InfiniteLoop), (website : apple.com), (phn : 567)}.
On Data Dependencies in Dataspaces Definition 11/24
Shaoxu Song [email protected]
Approximate Dependencies
Due to the extremely high heterogeneity, data dependenciesmight notexactly hold in a given dataspace.
ϕ4 : θ(manu,prod) → θ(tel,phn),
e.g.,(t1, t2) ≍ LHS(ϕ4) but (t1, t2) 6≍ RHS(ϕ4)
i.e.,(t1, t2) 6� ϕ4
t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),
(addr : InfiniteLoop,CA), (website : itunes.com)};
t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),
(post : InfiniteLoop,Cupert), (website : apple.com)};
t3 : {(name : iPad), (color : white), (manu : Apple Inc.),
(post : InfiniteLoop), (website : apple.com), (phn : 567)}.
On Data Dependencies in Dataspaces Definition 12/24
Shaoxu Song [email protected]
Measure
To evaluate how a dependency “almost” holds in a data instance
Error measure
g3(ϕ,S) =|S| − max{|T| | T ⊆ S,T � ϕ}
|S|,
the minimum number of objects that have to be removed fromthe dataspaceS for a dependencyϕ to hold.
Confidence measure
conf(ϕ,S) =max{|T| | T ⊆ S,T � ϕ}
|S|.
the maximum number of objects reserved after removingminimum objects of violations with respect toϕ.
On Data Dependencies in Dataspaces Definition 13/24
Shaoxu Song [email protected]
Example
ϕ4 : θ(manu,prod) → θ(tel,phn),
Error measure
{t2} is a minimum violation set w.r.t.ϕ4
such that all the remaining objects{t1, t3} satisfyϕ4
g3 = 1/3
Confidence measure
{t1, t3} a maximum keeping set w.r.t.ϕ4
conf = 2/3
t1 : {(name : iPod), (color : red), (manu : Apple Inc.), (tel : 567),
(addr : InfiniteLoop,CA), (website : itunes.com)};
t2 : {(name : iPod), (color : cardinal), (prod : Apple), (tel : 123),
(post : InfiniteLoop,Cupert), (website : apple.com)};
t3 : {(name : iPad), (color : white), (manu : Apple Inc.),
(post : InfiniteLoop), (website : apple.com), (phn : 567)}.
Outline
Introduction
Definition
Validation
Discovery
Experiment
Conclusion
On Data Dependencies in Dataspaces Validation 14/24
Shaoxu Song [email protected]
Validation Problem
Unfortunately, computation of error or confidence is generally hard
Given
a dataspaceS
a dependencyϕ
a measure requirementη
the validation problem is to decide whether or not the measure ofϕoverS satisfiesη.
E.g., to determine whetherg3(ϕ,S) ≤ 0.2 orconf(ϕ,S) ≥ 0.8.
TheoremThe error and confidence validation problems areNP-complete.
On Data Dependencies in Dataspaces Validation 15/24
Shaoxu Song [email protected]
The Hardness
The transitivity cannot be assumed, i.e., from(t1, t2) ≍ θ(Ai,Ai) and(t2, t3) ≍ θ(Ai,Ai) it does not necessarily follow that(t1, t3) ≍ θ(Ai,Ai).
t1 :{(A1 : abc), . . . };
t2 :{(A1 : abcd), . . . };
t3 :{(A1 : abcde), . . . }.
E.g.,θ(A1,A1) : [A1 ≈≤1 A1] with edit distance as metricd
d(t1[A1], t2[A1]) = 1 ≤ 1
d(t2[A1], t3[A1]) = 1 ≤ 1,
but d(t1[A1], t3[A1]) = 2 > 1, that is,(t1, t3) 6≍ θ(A1,A1).
The efficient validation computation based on disjoint groupingcannot be applied in this case of comparable functions.
On Data Dependencies in Dataspaces Validation 16/24
Shaoxu Song [email protected]
Approximation Computation
Compute an approximate error/confidence measure ofϕ overS
the approximate measure has a relative performance guaranteecompared with exact measure,
e.g.,g/g ≤ ρ
whereg is an approximation of exact error measureg andρ isapproximation ratio
On Data Dependencies in Dataspaces Validation 17/24
Shaoxu Song [email protected]
Greedy Algorithm
greedily count both objects when a violation occurs
the complexity isO(n2)
The error approximation
outputs an estimateg with a boundg ≤ g ≤ 2g compared withthe exact error measureg
TheoremThe confidence has no constant-factor approximation unlessP=NP
confidence isNP-hard to approximate within a constant factor
g3 error and confidence are not equivalent in anapproximation-preserving way
On Data Dependencies in Dataspaces Validation 18/24
Shaoxu Song [email protected]
Randomized Algorithm
Greedy algorithm still has to consider all the objects in a dataspace.
Randomized algorithm evaluates just a small subset of objects
randomly drawm samples
estimate the error/confidence measure by using the violations tothesemsamples
the estimate measure is still guaranteed by certain approximationbound with high probability
e.g., Pr[g ≤ ρg+ ǫ] ≥ δ
ǫ is an additive error
δ is a probability guarantee
m is determined byǫ andδ
Outline
Introduction
Definition
Validation
Discovery
Experiment
Conclusion
On Data Dependencies in Dataspaces Discovery 19/24
Shaoxu Song [email protected]
Discovery Problem
The strict dependency discovery problem
find a canonical cover of all dependencies that hold in data
a canonical cover can be exponential in # of attributes
high dimensionality in dataspaces (attributes and comparablefunctions)
highly non-trivial (if not impossible) to discover a canonicalcover of all dependencies
Thek-length dependencies
containk or less comparable functions
motivated by the concept of miningk-length itemsets inassociation rules
On Data Dependencies in Dataspaces Discovery 20/24
Shaoxu Song [email protected]
Pay-as-you-go Discovery
In previous work of dataspaces, comparable attributes are identified inapay-as-you-gostyle.
The discovery of dependencies should be conducted in an incrementalway as well.
given a setΣ of currently discovered dependencies
and a newly identified comparable functionsθ(Ai,Aj),
we can generate new dependencies w.r.t.θ(Ai,Aj) according tothe augmentation property
On Data Dependencies in Dataspaces Experiment 21/24
Shaoxu Song [email protected]
Validation EvaluationExperiments on real data sets
compute the exact/approximate measures of dependencies
observe the corresponding performance
exact computation does not scale well
approximation computation keeps significantly lower time costand scale well
0.001
0.01
0.1
1
10
100
1000
10000
300 350 400 450 500 550 600
Tim
e C
ost (
s)
# Objects
(b) time performance
GreedyRandomized
Exact
0
0.2
0.4
0.6
0.8
4k 5k 6k 7k 8k 8k 10k
Tim
e C
ost (
s)
# Objects
base
GreedyRandomized
On Data Dependencies in Dataspaces Experiment 22/24
Shaoxu Song [email protected]
Discovery EvaluationIllustrate the incremental discovery of dependencies withthe increaseof functions.
y-axis is scaled logarithmicallytime cost increases heavily with the number of functionsthe intrinsic hardness in discovering dependencies with respectto attributes (and the corresponding comparable functions)sizek of functions also affects the discovery performance largely
0.01
0.1
1
10
100
1000
10 20 30 40 50 60 70
Tim
e C
ost (
s)
# Functions
(a) incremental performance
k=4k=3k=2
On Data Dependencies in Dataspaces Experiment 23/24
Shaoxu Song [email protected]
Discovery Evaluation
The discovery algorithm scales well in large size of objects
greedy approximation is adopted for validation
verify the efficiency of approximation computation proposed forvalidating dependencies
0
500
1000
1500
2000
2500
4k 5k 6k 7k 8k 8k 10k
Tim
e C
ost (
s)
# Objects
base
k=4k=3k=2
On Data Dependencies in Dataspaces Conclusion 24/24
Shaoxu Song [email protected]
Conclusion
This is the first work to adapt dependencies to dataspaces with theconsideration of comparable attribute values
it is already hard to tell whether a dependency almost holds inthe data
the confidence validation is also proved hard to approximatetowithin any constant factor
propose several greedy and randomized approaches forapproximately solving the validation problem
study the pay-as-you-go discovery of dependencies fromdataspaces