faster query answering in probabilistic databases using read-once functions sudeepa roy joint work...

1

Faster Query Answering in Probabilistic Databases using Read-Once Functions

Sudeepa Roy

Joint work with

Vittorio PerducaVal Tannen

University of Pennsylvania

2

Probabilistic DatabasesPossible worlds model

Each possible world w is a standard database instance, has a probability P[w]

Compact representation D based on independence assumptions

Query Semantics in Probabilistic Databases (wlog.) Boolean query q Traditional database: q(D) {true, false} Probabilistic database: P[q(D)] = ∑q(w) = true P[w]

Goal: Efficiently evaluate P[q(D)] Data complexity; want time polynomial in n = |D|

3

Computation of P[q(D)]Can we efficiently compute P[q(D)]?

NO, In general #P-hard

DalviSuciu’04, ff. : Positive queries can be partitioned into Safe queries: Safe plans run in poly-time on all instances Unsafe queries: Data complexity is #P-hard

Includes very simple queries like R(x) S(x, y) T(y) Given q as input, we can efficiently decide whether q

is safe

BUT: For unsafe queries, probabilities on some instances can be

efficiently computed Our Approach: Take both q and D as input

Restrictions

a1

a2

a3

a3

b1

b1

b2

b3

0.1

0.5

0.2

0.1

Tuple-independent representation D Tuple t annotated by P[t]

a1

a2

a3

0.3

0.4

0.6

b1

b2

b3

0.7

0.8

0.4

R S T

a1 b1a1 b1

R S T

P[w] = 0.3 (1 – 0.4) (1 – 0.6) 0.1 (1 – 0.5) (1–0.2) (1–0.1) 0.7 (1–0.8) (1 – 0.4)

w = a possible world

D =

Conjunctive query without self-join (CQ-) q():= R(x)S(x, y)T(y) (This is the H0 query from Suciu’s keynote)

Probability

5

Query Answering in Two Steps: Example Event variables for tuples Step 1: Event expression for q(D) or “lineage”

E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 The “form” of the expression depends on query plan; here ()((R ⋈ S) ⋈

T)

Step 2: Compute P[q(D)] = P[E] given Pr[w1] = 0.3, Pr[v1] = 0.4, ….

This work: take advantage of Read-Once expressions

D a1

a2

a3

a3

b1

b1

b2

b3

v1

v2

v3

v4

0.1

0.5

0.2

0.1

a1

a2

a3

w1

w2

w3

0.3

0.4

0.6

b1

b2

b3

u1

u2

u3

0.7

0.8

0.4

RT

S

Probability

Event variables

q():= R(x), S(x, y), T(y)

EASY

HARD

a1

a2

a3

a3

b1

b1

b2

b3

0.1

0.5

0.2

0.1

a1

a2

a3

0.3

0.4

0.6

b1

b2

b3

0.7

0.8

0.4

6

Read-Once Boolean ExpressionsExpression in Read-once Form: Every variable occurs exactly once

e.g. ((x+y)z + w)(u+v) Linear time probability computation

P(x y) = P(x) P(y) P(x + y) = 1 – (1 -P(x)) (1 – P(y))

Read-once Expression: Has an equivalent read-once form. e.g.

xzu + xzv + yzu + yzv + wu+ wv [in DNF, as large as O(n|q|)] xzu + xzv + (yz + w)(u+v) [not in DNF, can be much smaller]

Non-read-once Expressions: No read-once form e.g.. xy + yz + zx, xy + yz + zw

x y

z u v

7

Read-Once Event ExpressionsSafe plans for safe queries directly produce

expressions in read-once form (OlteanuHuang’08)

Unsafe queries can also produce read-once expressions

Our example is read-once E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 = (w1 v1 + w2 v2) u1 + w3 (v3 u2 + v4 u3) Corresponds to unsafe query q():= R(x) S(x, y) T(y) No query plan can produce the read-once form directly

8

Problem DefinitionGiven

a boolean CQ- query q, a tuple-independent database D,

Can we efficiently decide whether the event expression corresponding to q(D) is read-once?

If yes, can we compute the read-once form efficiently? (then P[q(D)] can be computed efficiently)

9

Read-once-ness: only a sufficient condition to efficiently compute P[q(D)]

e.g., E = x1 x2 + x2 x3 + x3 x4 + …… Not read-once P[E] can be computed in poly-time using dynamic

programming Moreover, see detailed analysis in JhaSuciu ’11 using

OBDD, FBDD, d-DNNF

E is read-once

read-once formof E can be computed efficiently P[E] can be

computed efficiently

10

OutlineBackground

Existing characterization of read-once expressions Co-occurrence Graphs

Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form

Related work, Future work and Conclusion

11

OutlineBackground




12

Characterization of Read-once Expressions

A positive boolean expression is read-once if and only if its “co-occurrence graph” is P4-free (no simple induced path with four vertices) and “normal”.

Gurvich’ 77, ’ 91 Can be checked (and computed) in poly-time if the

expression is given in DNF (GolumbicMR’ 06)

z

13

Co-occurrence Graph - GCO

Graph on variables in the expression as vertices

1. Express boolean expression in irredundant DNF xy + xyz + zx xy + zx

2. Put an edge between variables if they co-occur in a disjunct

Can be easily computed if the expression is in DNF

y

x

z

14

OutlineBackground




15

Our Contributions1. DNF of event expression is not needed for CQ-

GCO can be directly computed from “provenance DAGs”

2. We do not need to compute GCO

A subgraph of GCO suffices – “Co-table graph” GCT

Our Framework

Compute GCO

Use existing read-once testing algorithms

Compute GCT

Use our read-once testing algorithm

(1) Uses Gurvich’s characterization

vs.(2) Uses alternative

(2) Is faster than (1)

(1)

(2)

16

Provenance DAGEvent expressions, called “lineage” (Suciu keynote), are

a form of provenance (GreenKarvounarakisT ’07).

We use provenance DAGs (Green et. al. ’07)

Query q():= R(x), S(x, y), T(y)

Query Plan ()((R ⋈ S) ⋈ T) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4

u3

w1 w2 w3

v1 v2v3 v4

u1 u2 u3

17

Co-Table Graph -- GCT

Subgraph of Gco: |GCT| |GCO|

Put an edge between variables only if their tables share variables in q

e.g.: q():= R(x) S(y) R, S have n tuples each, GCO has n2 edges, GCT has zero!

q():= R(x) S(x, y) T(y)

E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3

w1

w2

w3

u1

u2

u3

v1

v2

v3

v4

w1

w2

w3

v1

v2

v3

v4

u1

u2

u3

GCO GCT

18

Our AlgorithmInput: Provenance DAG, H

Obtained from the query plan

Step1: Compute GCT

(the same procedure can compute GCO as well)

Step2: Compute read-once form (if possible) Otherwise output that event expression is not read-

once

19

Step1: Computing GCT Theorem: Two variables are adjacent in GCT

if and only if their least common ancestor set contains a

product-node in the provenance DAG

y x Z

E = xy + xz

Proof uses critically the no-self-join assumption

20

Step2: Computing Read-once formInput: GCT

Alternate between Row Decomposition and Table Decomposition

Recursive computationExactly one can be done at a recursion level, otherwise not read-onceProof uses critically no-union assumptionSound and Complete

q

q

q

E1

E2

E3

E = E1 + E2 + E3

Row decompositionq1 q2

E1 E2

E = E1 E2

Table decomposition

21

Example: Row Decomposition

a1

a2

a3

a3

b1

b1

b2

b3

v1

v2

v3

v4

a1

a2

a3

w1

w2

w3

b1

b2

b3

u1

u2

u3

R S T

q():= R(x), S(x, y), T(y)

E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3

w1

w2

w3

v1

v2

v3

v4

a1

a2

b1

b1

v1

v2

a1

a2

w1

w2

b1 u1R1 S1 T1

u1

u2

u3

+

22

Example: Table Decomposition

w1

w2

v1

v2

a1

a2

b1

b1

v1

v2

a1

a2

w1

w2

b1 u1

R1 S1T1

u1

q():= R(x), S(x, y), T(y)

q1():= R(x), S(x, y1) q2():=

T(y2)

(w1 v1 + w2 v2)

u1(w1 v1 + w2 v2)u1

Final Expression: (w1 v1 + w2 v2)u1 + w3(v3 u2 + v4 u3)

23

Overall Time ComplexityInput: Provenance DAG HStep1: Compute GCT or GCO

Time complexity ≈ O(n mH + WH mCO) mH = #edges in H, WH = width of H, mCO = #edges in GCO, mCT = #edges in GCT

Step2: Compute read-once form (if possible) Using our algorithm: O((mCT + n) min (|q|, √n)) ; Data complexity O(mCT + n) Using existing algorithms: O(mCO + n), mCT ≤ mCO

SummaryAnalysis uses “charging argument”Bound recursion depth, total time at each recursion levelStep1 is more expensiveStep2 is linear

In |GCO| for existing algorithms

In |GCT| for our algorithms

|GCT| ≤ |GCO|

24

OutlineBackground

Co-occurrence Graphs Existing characterization of read-once expressions



25

Related Work

SenDeshpandeGetoor’ 10 Independent work, considers the same problem Shows that “normality” check is not needed for CQ-

Tests P4-freeness using “lineage-trees” without computing the co-occurrence graph

Our work: Computes the co-occurrence graph without DNF

computation existing algorithms can be used. Was an open question in SenDeshpandeGetoor’10

Obtains a faster and simpler algorithm Time complexity comparison in the paper Uses BFS/DFS, easier to implement Uses compact provenance DAGs instead of lineage trees

26

Other Related Work Semantics of probabilistic query answering

Fuhr-Rollecke ’97, Zimanyi ‘97 Dichotomy of CQ- ,CQ and UCQ queries

Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 Knowledge compilation techniques

Olteanu-Huang ’08 Jha-Olteanu-Suciu ‘10 Jha-Suciu ’11 Fink-Olteanu ‘11

27

Conclusion and Future WorkCan co-occurrence/co-table graph be computed as a pre-processing step?

This is the more expensive step Akin to building indexes on databases but depends on

query’s “join pattern” Cache the already computed GCT with the join pattern

How to handle Larger classes of queries (UCQ?) and database models

(disjoint independent?) Other efficient knowledge-compilation forms

28

Thank You.

Questions?

faster query answering in probabilistic databases using read-once functions sudeepa roy joint work...

Documents