1 special cases: query semantics: (“marginal probabilities”) run query q against each...
TRANSCRIPT
1
Special Cases:
Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each
answer tuple t, sum up the probabilities of all instances Di where t exists.
A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic
database instances Di.
WorksAt(Sub, Obj)
Jeff Stanford
Jeff Princeton
WorksAt(Sub, Obj)
Jeff Stanford
WorksAt(Sub, Obj)
Jeff Princeton
WorksAt(Sub, Obj)
D1: 0.42 D2: 0.18 D3: 0.28 D4: 0.12
WorksAt(Sub, Obj)
p
Jeff Stanford 0.6
Jeff Princeton 0.7
(1) Dp tuple-independent (II) Dp block-independent
Note: (I) and (II)
are not equivalent!
Probabilistic Database
WorksAt(Sub, Obj)
p
Jeff Stanford 0.6
Princeton 0.4
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs. (Hard) Consistency Constraints
People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)
People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z
People are not married to more than one person (at the same time, in most countries?)
marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)
[0.8]
[0.5]
Basic Types of Inference
MAP Inference
Find the most likely assignment to query variables y under a given evidence x.
Compute: arg max y P( y | x) (NP-complete
for MaxSAT)
Marginal/Success Probabilities
Probability that query y is true in a random world under a given evidence x.
Compute: ∑y P( y | x) (#P-complete already for conjunctive queries)
[Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13]
Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog)
\/
/\
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
C D
A B
A(B (CD)) A(B (CD))
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)
[0.4]
graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton)
[0.7]graduatedFrom(Surajit, Stanford)
[0.6]graduatedFrom(David, Princeton)
[0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist)
[1.0]type(David, Computer_Scientist)
[1.0]
Lineage & Possible Worlds
1) Deductive Grounding Dependency graph of the query Trace lineage of individual query
answers
2) Lineage DAG (not in CNF),
consisting of Grounded soft & hard rules Probabilistic base facts
3) Probabilistic Inference Compute marginals:
P(Q): sum up the probabilities of all possible worlds that entail the query answers’ lineage
P(Q|H): drop “impossible worlds”
\/
/\
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266
1-(1-0.72)x(1-0.6)=0.888
0.8x0.9=0.72
C D
A B
A(B (CD)) A(B (CD))
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
[Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13]
Possible Worlds Semantics
A:0.7
B:0.6
C:0.8
D:0.9
Q2: A(B(CD))
P(W)
1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024
1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336
1 1 0 1 0 … = 0.0756
1 1 0 0 0 … = 0.0084
1 0 1 1 0 … = 0.2016
1 0 1 0 0 … = 0.0224
1 0 0 1 0 … = 0.0504
1 0 0 0 0 … = 0.0056
0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296
0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144
0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324
0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036
0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864
0 0 1 0 0 … = 0.0096
0 0 0 1 0 … = 0.0216
0 0 0 0 0 … = 0.0024
1.0
0.2664
0.412
P(Q2)=0.2664
P(Q2|H)=0.2664 / 0.412 = 0.6466
P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903
0.0784
Hard rule H: A (B (CD))
Basic Complexity Issue
Theorem [Valiant:1979]For a Boolean expression E, computing P(E) is #P-complete.
NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT
The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete.
[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]
Tuple-Independent Probabilistic Database
Theorem: The query answering problem for the above join is
#P-hard.
A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic
database instances Di.
WorksAt(Pers, Uni)
p
Jeff Stanford 0.6
Jeff Princeton 0.7
Probabilistic Database
IsProfessor(Pers)
P
Jeff 0.9
Located(Uni, State)
p
Stanford CA 0.8
Princeton
CA 0.5
Which professors work at a university that is located in CA?
Probabilistic & Temporal Database
Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to
their non-temporal counterparts at each snapshot (i.e., time point) of the database.
Coalesce/split tuples with consecutive time intervals based on their lineages.
Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular
attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations.
Deduplicate tuples with overlapping time intervals based on their lineages.
A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of
deterministic database instances Di at each time point of a finite time domain T.
BornIn(Sub,Obj)
T p
DeNiro Green- which
[1943, 1944)
0.9
DeNiro Tribeca [1998, 1999)
0.6[Dignös, Gamper, Böhlen: SIGMOD’12]
[Dylla,Miliaraki,Theobald: PVLDB’13
Wedding(Sub,Obj)
T p
DeNiro Abbott [1936, 1940)
0.3
DeNiro Abbott [1976, 1977)
0.7
Divorce(Sub,Obj)
T p
DeNiro Abbott [1988, 1989)
0.8
Temporal Alignment & Deduplication Example
Non-Sequenced Semantics:
f1
1936 1976 1988
f2 ¬f3
f2 f3f1 ¬f3
f1 f3
(f1 f3) (f1 ¬f3)
(f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3)
(f1 f3) (f2 ¬f3)
T
marriedTo(x,y)[tb1,Tmax) wedding(x,y)[tb1,te1) ¬divorce(x,y)[tb2,te2)
marriedTo(x,y)[tb1,te2) wedding(x,y)[tb1,te1) divorce(x,y)[tb2,te2) te1 ≤T tb2
BaseFacts
DerivedFacts
Dedupl.Facts
wedding(DeNiro,Abbott)
wedding(DeNiro,Abbott)
divorce(DeNiro,Abbott)
Tmax
f2
f3
Tmin
0.08
0.120.16
0.40.6
‘03 ‘05 ‘07playsFor(Beckham, Real, T1)
Base Facts
DerivedFacts
0.20.20.1
0.4
‘05‘00 ‘02 ‘07playsFor(Ronaldo, Real, T2)
‘04
‘03 ‘04 ‘07‘05
playsFor(Beckham, Real, T1)Ù playsFor(Ronaldo, Real, T2)Ù overlaps(T1, T2, T3)
t3 teamMates(Beckham, Ronaldo, t3)
teamMates(Beckham, Ronaldo, T3)
Inference in Probabilistic-Temporal Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
Example using the Allen predicate overlaps
0.40.6
‘03 ‘05 ‘07playsFor(Beckham, Real, T1)
Base Facts
DerivedFacts
playsFor(Ronaldo, Real, T2)
0.20.20.1
‘05‘00 ‘02 ‘07‘04
0.4
0.08
0.120.16
‘03 ‘04 ‘07‘05
playsFor(Zidane, Real, T3)
teamMates(Beckham, Zidane, T5)
teamMates(Ronaldo, Zidane, T6)
teamMates(Beckham, Ronaldo, T4)
Non-independent
Independent
Inference in Probabilistic-Temporal Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
playsFor(Beckham, Real, T1)Base Facts
DerivedFacts
playsFor(Ronaldo, Real, T2)
playsFor(Zidane, Real, T3)
teamMates(Beckham, Zidane, T5)
teamMates(Ronaldo, Zidane, T6)
Non-independent
Independent
Closed and complete representation model (incl. lineage)
Temporal alignment is polyn. in the number of input intervals
Confidence computation per interval remains #P-hard
In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning
teamMates(Beckham, Ronaldo, T4)
Need
Lineage!
Inference in Probabilistic-Temporal Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]