tractability in probabilistic databases · this talk: query evaluation • i will describe recent...

Tractability in Probabilistic Databases

Dan Suciu University of Washington

ICDT 2011 1 Dan Suciu: Tractability in Probabilistic Databases

Slides available from h0p://www.cs.washington.edu/homes/suciu/

Motivation Lots of uncertain data: •  External data in business intelligence: blogs,

emails, twitter, social network data

•  Data generated by large information extraction systems

•  Data integration in the presence of uncertainty

•  Specialized data (RFID, scientific data)

ICDT 2011 Dan Suciu: Tractability in Probabilistic Databases 2

Probabilistic Databases

•  In traditional databases: data is certain •  In new applications: data is uncertain

•  Probabilistic Databases: model uncertainty using probabilities

•  Theoretical foundations: logic+probability


Landscape of PDB Research Incomplete list: •  Representation

–  [Widom’05, Antova’07, Sen’07, Benjelloun’08] •  Query evaluation

–  [Dalvi’04, Antova’08,Olteanu’09, Koch’08,Kimelfeld’09,Deutsch’10] •  Data integration, data exchange

–  [Dong’07,Agrawal’10,Fagin’10] •  Applications

–  [Nierman’02,Keulen’05, Gupta’06,Hassanzadeh’09,Stoyanovich’10] •  Ranking

–  [Re’07,Soliman’07,Zhang’08,Li’09] •  Indexing

–  [Letchner’09,Kanagal’09, Kimura’10] •  Pushing ML in the db engine

–  [Wang’08,Wick’10]


Healthy research …

Landscape of PDB Systems

•  Some research prototypes: –  MaybeMS (Oxford&Cornell), Trio (Stanford), MystiQ

(UW), ProbDB (Maryland), Orion (Purdue) –  They do not scale to large datasets

•  Commercial systems: NONE ! –  Why ? Because to date we do not know how to build

scalable probabilistic database systems

Main challenge: Query Evaluation

This Talk: Query Evaluation

•  I will describe recent progress: tractable queries

•  I will discuss where we are stuck: intractable queries

This talk is based on:

•  Dalvi, S. Efficient query evaluation on probabilistic databases. VLDB’04

•  Dalvi,S. The Dichotomy of Conjunctive Queries on Probabilistic Structures, PODS’07

•  Dalvi, Schnaitter, S.: Computing query probability with incidence algebras. PODS’10

•  Gatterbauer, Jha, S., Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases. MUD'2010

Adver&sement: Koch, Olteanu, Re, S., Probabilis)c Databases, Morgan & Claypool, planned for June, 2011

Outline

•  Problem Statement

•  Intractable queries

•  Tractable queries

•  Summary and Open Problems

7 Dan Suciu: Tractability in Probabilistic Databases ICDT 2011

Probabilistic Database Every tuple t in D = random variable (present/absent) Possible worlds semantics: D = (W, p), where p:Wà[0,1] s.t. ΣW∈W p(W) = 1

A B a1 b1 .7 a2 b2 .4 a2 b3 .2 a2 b4 .2 a3 b5 .5

S(A,B)

A a1 .4 a2 .2 a3 .5

R(A) 8 tuples à 28 possible worlds W = {W1, W2, …, W256}

In this talk = all tuples are independent

Need correlaPons ? Normalize database !

Queries = UCQ

9

Q := R(x1,…,xk) | ∃x.Q | Q1 ∧ Q2 | Q1 ∨ Q2

R(x),S(x,y) ∃x.∃y. R(x)∧S(x,y) ≣

Traditional database: D ⊨ Q (true/false) Probabilistic database: P(Q)=ΣW∈W: W ⊨ Q p(W) (in [0,1])

In this talk, I consider only Boolean queries:

Unions of ConjuncPve Queries:

Lineage

Query Q + Database D = Lineage FQ •  Every tuple t in D = Boolean variable Xt

•  The lineage FQ = a propositional formula – Definition: next slide –  Intuitively: it says when Q is true on D

•  Terminology alert: what we call here lineage corresponds to the PosBool semiring [Tannen, ICDT’10]


Definition of Lineage FQ

A B a1 b1 Y1

a1 b2 Y2

a2 b3 Y3

a2 b4 Y4

a2 b5 Y5

S(A,B) A a1 X1

a2 X2

a3 X3

R(A)

Q = R(x),S(x,y)

FQ = X1 Y1 ∨ X1 Y2 ∨ X2 Y3 ∨ X2 Y4 ∨ X2 Y5

Def. Ft = Xt (t = ground tuple t) F∃x.Q = FQ[a1/x] ∨ . . . ∨ FQ[an/x] (active dom.={a1,…an}) FQ1∧Q2 = FQ1 ∧ FQ2 FQ1∨Q2 = FQ1 ∨ FQ2

Example

For any fixed Q, FQ has a DNF of size polynomial

in size(D)

Problem Statement


Problem. Fix Q. Given D, compute P(Q); Equivalently: compute P(FQ).

This is a form of probabilistic inference (next)

Dichotomy Theorem. For any, fixed UCQ Q: •  either P(Q) is in PTIME -- tractable •  or P(Q) is hard for FP#P -- intractable

This talk I discuss tractable and intractable queries

Data complexity

Discussion: Probabilistic Inference


Graphical models: X Y

Z U

P(X,Y,Z,U) = P(X) P(Y) P(Z|X,Y) P(U|Y)

Bayesian Network X Y

Z U

Markov Network

Φ(X,Y,Z,U) = 1/Z * Φ1(X,Y,Z) Φ2(Y,U)

Propositional formula = very special case

F = X Y ∨ Y Z ∨ X Z Y ZX

∧ ∧∧

∨

Y ZX

∧ ∧∧

∨

Inference: exponential in

tree-width

Large tree-width

[Pearl’88,Koller&Friedman’09,Darwiche’09]

Discussion: Probabilistic Inference


Graphical Models Prob DBs

Model Complex P(X) = 1/Z Πi Φi(Xi)

Simple Independent tuples

Query Simple Q = P(X,Y | Z,U,V)

Complex Q =∃x.∃y.R(x)∧S(x,y)

Network Static Bayesian/Markov Network

Dynamic FQ = Q + D

Complexity f(tree-width, network) f(Q, D)

Inference in GM’s is exponential in tw; ill suited for PDB Need new, query-based approach: data complexity

Outline






Intractable Queries

16

Theorem. Data complexity of Hk is hard for FP#P

H1= R(x0),S(x0,y0) ∨ S(x1,y1),T(y1)

H2= R(x0),S1(x0,y0) ∨ S1(x1,y1),S2(x1,y1) ∨ S2(x2,y2),T(y2)

H3=R(x0),S1(x0,y0) ∨ S1(x1,y1),S2(x1,y1) ∨ S2(x2,y2),S3(x2,y2)∨S3(x3,y3),T(y3)

H0= R(x),S(x,y),T(y)

. . .

Will give the proof for H0 and H1 next

[Dalvi&S.’04]

[Dalvi&S.’07, Dalvi,Schnai0er,S.’10]

Model Counting

Theorem The problem: “given a PP2DNF F, count the number of satisfying assignments #F” is #P-hard.

Definition A propositional formula F is a Positive, Partite, 2DNF if F = ∨i,j Xi Yj

Example: F = X1 Y1 ∨ X1 Y2 ∨ X2 Y3 ∨ X2 Y4 ∨ X2 Y5

[Provan&Ball’83]

Holds also for PP2CNF

Proof: H0 is hard for FP#P Reduction from #PP2DNF •  Given F = Xi1 Yj1 ∨ Xi2 Yj2 ∨ …

construct D =

•  Assignment θ ↔ possible world RW,TW

•  θ satisfies F ↔ (RW,S, TW) ⊨ H0 •  Therefore P(H0) = #F / 2n,

where n = number of variables

H0= R(x),S(x,y),T(y)

[Dalvi&S.’04]

A B

Xi1 Yj1 1

XI2 Yj2 1

Xi3 Yj3 1

…

A

X1 ½

X2 ½

…

B

Y1 ½

Y2 ½ …

S R T

We can compute #F using an oracle for P(H0); hence H0 is hard for FP#P

Proof: H1 is hard for FP#P By reduction from #PP2CNF: •  F = (Xi1∨Yj1)∧(Xi2∨Yj2)∧… ∧(Xim∨Yjm)

Construct D =

•  Assignment θ ↔ possible world RW,TW (no S!) •  P(¬H1 | θ) = z#cnt(θ), where #cnt(θ) = |{k | θ(Xik∨ Yjk) = true}| •  P(¬H1) = ΣθP(¬H1 |θ)P(θ) = 1/2n Σk=0,m ak zk = f(z)

where ak = |{θ | #cnt(θ) = k}|

•  Compute f(z) at m+1 distinct values z, using oracle for P(H1). Obtain all coefficients a0,a1,…,am

•  Return #F = am

A B

Xi1 Yj1 1-z

XI2 Yj2 1-z

Xi3 Yj3 1-z

…

A

X1 ½

X2 ½

…

B

Y1 ½

Y2 ½ …

S R T H1= R(x0),S(x0,y0) ∨ S(x1,y1),T(y1)

[Dalvi&S.’07]

We can compute #F using an oracle for P(H1); hence H1 is hard for FP#P

Discussion of Intractable Queries

•  H0 corresponds to a Restricted Boltzmann Machine

•  The hardness proof for Hk, k ≥ 2, and other queries, requires significant extensions

•  There are other intractable queries: will describe all of them shortly

Open problems: Is the model counting problem for an intractable query #P hard ? How do we evaluate/approximate them ?


[Salakhutdinov’10]

Outline



•  Tractable queries – Safe plans – An algorithm for UCQ queries



Overview

•  Traditional query processing is done with query plans using relational operators

•  Query processing on probabilistic databases is done with query plans using simple extensions of the relational operators


A B P a1 b1 q1 Y1

a1 b2 q2 Y2

a2 b3 q3 Y3

a2 b4 q4 Y4

a2 b5 q5 Y5

S(A,B)

A P a1 p1 X1

a2 p2 X2

a3 p3 X3

R(A)

Q = ∃x.∃y. R(x),S(x,y)

P(Q) = 1-(1-q1)*(1-q2) p1*[ ]

1-(1-q3)*(1-q4)*(1-q5) p2*[ ]

1- {1- } *

{1- }

FQ = X1Y1 ∨ X1Y2 ∨ X2Y3 ∨ X2Y4 ∨ X2Y5

“Read-once” *

= X1(Y1 ∨ Y2) ∨ X2(Y3 ∨ Y4 ∨ Y5)

Example

* See Sudeepa Roy’s talk on Tuesday

A B P

a1 b1 q1 Y1

a1 b2 q2 Y2

a2 b3 q3 Y3

a2 b4 q4 Y4

a2 b5 q5 Y5

S(A,B)

A P

a1 p1 X1

a2 p2 X2

a3 p3 X3

R(A)

⋈

A B P

a1 b1 p1*q1 X1 ∧ Y1

a1 b2 p1*q2 X1 ∧ Y2

a2 b3 p2*q3 X2 ∧ Y3

a2 b4 p2*q4 X2 ∧ Y4

a2 b5 p2*q5 X2 ∧ Y5

Independent join operator

T1(A,B)

A B P

a1 b1 q1 Y1

a1 b1 q2 Y2

a2 b2 q3 Y3

a2 b3 q4 Y4

a2 b2 q5 Y5

S(A,B)

A P

a1 1 - (1-p1)*(1-p2) Y1 ∨ Y2

a2 1 - (1-p3)*(1-p4)*(1-p5) Y3 ∨ Y4 ∨Y5

Independent projection operator

ΠA

T2(A)

Operators

… other operators for self-joins and for unions

i i

Q = ∃x.∃y. R(x),S(x,y)

q1

q2

q3

q4

q5

p1

p2

p3

⋈

p1 q1

p1 q2

p2 q3

p2 q4

p2 q5

ΠΦ

S(A,B) R(A)

1-(1-p1q1)(1-p1q2)(1-p2q3)(1-p2q4)(1-p2q5)

1-(1-q1)(1-q2)

1-(1-q4)(1-q5) (1-q6)

1-{1-p1[1-(1-q1)(1-q2)]}* {1-p2[1-(1-q4)(1-q5) (1-q6)]}

Wrong Right

T2(A) T1(A,B)

i

i

⋈

ΠΦ

S(A,B) R(A)

ΠA

i

i

i

[Dalvi&S.’04]

Discussion •  An extended operator manipulates

probabilities explicitly: join, projection, etc

•  A safe plan is a plan that computes the query probability correctly

•  Next: I will give an algorithm for computing all tractable UCQ queries. Each tractable query can be computed using a safe plan, using a small set of extended operators, but I will not discuss safe plans in the rest of the talk


Outline



•  Tractable queries – Safe plans – An algorithm for UCQ queries



CQ with Self-Joins

28

QJ = q1, q2 = R(x1),S(x1,y1), T(x2),S(x2,y2)

FJ = [X1(Y1 ∨ Y2) ∨ X2 (Y3 ∨ Y4 ∨ Y5)] ∧ [Z1(Y1 ∨ Y2) ∨ Z2 (Y3 ∨ Y4 ∨ Y5)] Not read-once

QU = q1 ∨ q2 = R(x1),S(x1,y1) ∨ T(x2),S(x2,y2)

FU = (X1∨Z1) (Y1∨Y2) ∨(X2∨Z2) (Y3∨Y4∨ Y5) Read-once

P(QJ)= P(q1, q2) = P(q1) + P(q2) - P(q1 ∨ q2)

PTIME !

[Dalvi,Schai0er,S.’10]

Discussion

•  In order to handle self-joins, we needed ∨

•  CQ = not a natural class to study

•  UCQ = the natural class to study

•  Will give the algorithm next as five rules


BUMMER !

Five Rules for Query Evaluation

Rule 1: Inclusion/Exclusion Formula P(Q1 ∧ Q2 ∧ Q3) = P(Q1) + P(Q2) + P(Q3) - P(Q1 ∨ Q2) ‒ P(Q1 ∨ Q3) ‒ P(Q2 ∨ Q3) + P(Q1 ∨ Q2 ∨ Q3)

Note1: this is the dual of the more popular: P(Q1 ∨Q2 ∨Q3) = . . .

Note2: this rule is not used for inference

on GM or on propositional formulae; new in PDBs


31

Rule 2: Independent Project If z is separator variable, P(∃z.Q) = 1 – (1 – P(Q[a1/z])) × (1 – P(Q[a2/z])) × … Where active domain = {a1, a2, a3, … an}

Definition. z is called separator variable if: •  z occurs in every atom A in Q (z is a root variable) •  If two atoms A1, A2 in Q are unifiable,

then z occurs on a common position in A1, A2.

Example: QU = R(x1),S(x1,y1) ∨ T(x2),S(x2,y2) = ∃z.(∃y1.R(z),S(z,y1) ∨∃y2.T(z),S(z,y2)) z is “separator variable”


32

If Q1, Q2 are independent (no common symbols)

Rule 3: Independent Join P(Q1∧Q2) = P(Q1) × P(Q2)

Rule 4: Independent Union P(Q1∨Q2) = 1-(1-P(Q1)) × (1-P(Q2))


33

Rule 5a: Ranking attribute-constant Given attribute R(…A…) and constant ‘a’, replace R in Q with R = R1 ∪ R2, where R1 = σA=‘a’(R), R2 = σA≠’a’(R)

Rule 5b: Ranking attribute-attribute Given two attributes R(…A…B…) replace R in Q with R = R1 ∪ R2 ∪ R3, where R1 = σA<B(R) , R2 = σA>B(R) , R3 = σA=B(R)

Example: q = R(x,y),R(y,x) rewrite to q = R1(x,y),R2(y,x) ∨ R3(z,z)

Note: ranking is applied before all other rules

Summary

•  Five rules: 1.  Inclusion/exclusion 2.  Existential quantifier elimination (separator !) 3.  Independent AND 4.  Independent OR 5.  Ranking

•  Each rule reduces P(Q) to a simpler P(Q’) –  If we succeed, then P(Q) in PTIME –  If we fail, then P(Q) is hard for FP#P


Specific to PDB, not used in general

probabilisPc inference

An Example

35

QV = R(x1),S(x1,y1) ∨ S(x2,y2),T(y2) ∨ R(x3),T(y3)

DNF = H1 (hard !)


An Example

36


DNF Disconnected query = H1 (hard !)


An Example

37


QV =[S(x2,y2),T(y2)∨ R(x3)] ∧ [R(x1),S(x1,y1)∨T(y3)]

DNF

CNF

Disconnected query = H1 (hard !)


An Example

38


QV =[S(x2,y2),T(y2)∨ R(x3)] ∧ [R(x1),S(x1,y1)∨T(y3)]

DNF

CNF

= R(x3) ∨ T(y3)

P(QV) = P(q1∧q2)= P(q1) + P(q2)-P(q1∨q2)

PTIME !

Disconnected query = H1 (hard !)

Inclusion/exclusion:

QV has a subquery H1 that is hard, yet it is in PTIME !


Another Example

QW = [R(x0),S1(x0,y0) ∨ S2(x2,y2),S3(x2,y2)] ∧ /* Q1 */ [R(x0),S1(x0,y0) ∨ S3(x3,y3),T(y3)] ∧ /* Q2 */ [S1(x1,y1),S2(x1,y1) ∨ S3(x3,y3),T(y3)] /* Q3 */

39


Another Example

P(QW) = P(Q1) + P(Q2) + P(Q3) + - P(Q1 ∨ Q2) - P(Q2 ∨ Q3) – P(Q1 ∨ Q3) + P(Q1 ∨ Q2 ∨ Q3)

Also = H3

40

= H3 (hard !)



Another Example

P(QW) = P(Q1) + P(Q2) + P(Q3) + - P(Q1 ∨ Q2) - P(Q2 ∨ Q3) – P(Q1 ∨ Q3) + P(Q1 ∨ Q2 ∨ Q3)

Also = H3

41

PTIME !


= H3 (hard !)


How do we need to detect cancellaPons in the inclusion/excusion rule ?

August Ferdinand Möbius 1790-1868

•  Möbius strip •  Möbius function µ in

number theory •  Generalized to lattices

[Stanley’97,Rota’09] •  And now to queries !

42

From Query Q to Lattice L(Q)

•  Write Q in CNF: Q = Q1 ∧ Q2 ∧ … ∧ Qm.

•  For s ⊆ [m], denote Qs = ∨i∈sQi

Def. The CNF lattice of Q is L(Q)=(L,≤) where: •  L = { Qs | s ⊆ [m] } (up to logical equivalence) •  Qs1 ≤ Qs2 if the logical implication Qs2 à Qs1 holds


Example

Q1 Q2 Q3

Q2∨Q3 Q1∨Q2

Q1∨Q2∨Q3 = Q1∨ Q3 44

Blue nodes are in PTIME, Red nodes are #P hard.

1̂ =max(L) 1̂



L(QW) =

The Möbius’ Function 1̂

-1 -1 -1

1 1

0

45

1

1̂ -1 -1 -1

2

1

Def. The Möbius function: µ( , ) = 1 µ(u, ) = - Σu < v ≤ µ(v, )

1̂ 1̂ 1̂ 1̂ 1̂

Möbius’ Inversion Formula: P(Q) = - Σ Qi < µ(Qi, ) P(Qi) 1̂ 1̂

[Stanley 97]

Dan Suciu: Tractability in Probabilistic Databases ICDT 2011

Rule 1 (revised) Inclusion/Exclusion à Mobius’ Inversion Formula

The Dichotomy

Dichotomy Theorem Fix a UCQ query Q. 1.  Algorithm terminates, then P(Q) is in PTIME 2.  Algorithm fails, then P(Q) is hard for FP#P

Note 1: dichotomy into PTIME/FP#P based on “syntax” where “syntax” includes the Möbius function !

Note 2: the query complexity is open. Note 3: we cannot avoid the Möbius function (next).


Representation Theorem

Proof •  Let u0, u1, …, uk be the “join irreducibles” of L •  Associate them to the k+1 components of

Hk = hk0 ∨ hk1 ∨ … ∨ hkk •  For every v ∈ L, define Qv = ∨{hki | ui ≰ v} •  Define Q = ∧ {Qv | v = co-atom in L}


THEOREM For every lattice L, there exists Q s.t. L(Q) ≅ L and: •  The query at (= min(L)) is hard for FP#P

•  All other queries are in PTIME 0̂

1̂

Q9

0 Q is in PTIME iff µ( , )=0 ! 1̂ 0̂


Example:

Summary of Tractable Queries Five simple rules •  Each rule has operator (not shown) à safe plan •  Open problem: engineering challenges of safe

plans

Other approaches to compute P(F) include: read-once formulae, OBDDs, FBDDs, d-DNNF’s •  For probabilistic databases, the Möbius Rule rules

them all (see Abhay Jha’s talk Tuesday)


Möbius Über Alles

49

UCQ PTIME

d-DNNF

FBDD

OBDD=TW=PW

RO QJ

QV

QW

Q9

H0 H1 H2

?

QU

Necessary and sufficient

characterization

Sufficient Characterization

(Necessary conjectured)

H3

Preview of Abhay Jha’s talk on Tuesday

Outline






Summary •  Why we care:

– Strong demand for managing uncertain data •  What is difficult:

– Computational complexity of Logic + Probabilities

•  What we have seen today: – Tractable queries: the five rules are simple –  Intractable queries: hardness proofs are difficult – Dichotomy into PTIME/FP#P based on syntax


Open Problems •  Correlations using views:

– Markov Logic à Markov DBs ? •  Model counting for H0 (and others) #P hard ? •  Efficient approximation of H0 (and others) ? •  Query complexity for checking hardness ? •  Data (and query) complexity for:

FO (¬ ∀), interpreted predicates (≠, < ) ? •  Characterize UCQ(FBDD), UCQ(d-DNNF) •  Prove UCQ(d-DNNF) ≠ UCQ(PTIME)


[Richardson&Domingos]

The 4x Grand Challenge Challenge: make probabilistic dbs run at most 4x slower than deterministic dbs, by using approximations Suggested approach: •  A rich class of tractable UCQ queries are known •  Given any Q, find all tractable Q’, s.t. P(Q) ≈ P(Q’):

–  Model theoretic (Robert Fink’s talk on Tuesday) –  Dissociation [Gatterbauer’10] http://LaPushDB.com

This is an engineering challenge: search space for Q’; opPmize and execute Q’

Thank You !

QuesPons ?

Slides available from h0p://www.cs.washington.edu/homes/suciu/

tractability in probabilistic databases · this talk: query evaluation • i will describe recent...

Documents