management of probabilistic data: foundations and challenges

Click here to load reader

Post on 18-Jan-2016

20 views

Category:

Documents

2 download

Embed Size (px)

DESCRIPTION

Management of Probabilistic Data: Foundations and Challenges. Nilesh Dalvi and Dan Suciu Univerisity of Washington. Databases Are Deterministic. Applications since 1970’s required precise semantics Accounting, inventory Database tools are deterministic A tuple is an answer or is not - PowerPoint PPT Presentation

TRANSCRIPT

  • Management of Probabilistic Data: Foundations and ChallengesNilesh Dalvi and Dan SuciuUniverisity of Washington

  • Databases Are DeterministicApplications since 1970s required precise semanticsAccounting, inventoryDatabase tools are deterministicA tuple is an answer or is notUnderlying theory assumes determinismFO (First Order Logic)

  • Future of Data ManagementWe need to cope with uncertainties !

    Represent uncertainties as probabilities

    Extend data management tools to handle probabilistic data

    Major paradigm shift affecting both foundations and systems

  • Uncertainties EverywhereIn the schema mappings:Data spacesPay as you go data integrationIn the data mapping Life science data integrationObject reconciliation, fuzzy joinsIn the data itselfData by the masses Information ExtractionRFID data, sensor data[Halevy2007][Philippi&Kohler2006][Gupta&Sarawagi2006][Welbourne2007][Arasu06]

  • Example 1Data Integration in Life SciencesU2 integrates several biological databases[B.Louie et al.2007]User types: Gene ABCD1U2 finds 80 related proteinsRanks them by uncertainty scoreCorrect 9 functions are among top 11Example: find functional annotations of ABCD1EntrezProtein, Pfam, TIGRFAM, NCBI Blast, EntrezGeneNeed to represent uncertainties explicitly

  • Example 2Information Extraction[Gupta&Sarawagi2006]...52 A Goregaon West Mumbai ...Here probabilities are meaningful20% of such extractions are correct

    IDHouse-NoStreetCityP152Goregaon WestMumbai0.1152-AGoregaon WestMumbai0.4152GoregaonWest Mumbai0.2152-AGoregaonWest Mumbai0.22. . . .. . . .. . . .. . . .2 . . . .

  • Example 3RFID Ecosystem at UW[Welbourne2007]

  • RFID data = noisySIGHTING(tagID, antennaID, time)

    Derived data = ProbabilisticJohn entered Room 524 at 9:15 prob=0.6John carried laptop x77 at 11:03 prob=0.8. . .QueriesWhich people were in Room 478 yesterday ?Massive amounts of probabilistic data from RFIDs, sensors

  • A Model for UncertaintiesData is probabilistic

    Queries formulated in a standard language

    Answers are annotated with probabilitiesThis talk: Probabilistic Databases

  • Probabilistic databases:Long HistoryCavallo&Pitarelli:1987Barbara,Garcia-Molina, Porter:1992Lakshmanan,Leone,Ross&Subrahmanian:1997Fuhr&Roellke:1997Dalvi&S:2004Widom:2005Focus today: the Query Evaluation Problem

  • Has this been solved by AI ?Input: KBFix q Input: DB

    AIDatabasesDeterministicTheorem proverQuery processingProbabilisticProbabilistic inference[this talk]

  • OutlineData model

    Query evaluation

    Challenges

  • What is a Probabilistic Database (PDB) ?HasObjectpWhat does it mean ?KeysProbability[Barbara et al.1992]Non-keys

    ObjectTimePersonPLaptop779:07John0.62Jim0.34Book3029:18Mary0.45John0.33Fred0.11

  • BackgroundFinite probability space = (, P)= {1, . . ., n} = set of outcomesP : [0,1]P(1) + . . . + P(n) = 1Event: E , P(E) =E P()Independent: P(E1 E2) = P(E1) P(E2)Mutual exclusive or disjoint: P(E1E2) = 0

  • Possible Worlds Semantics={}p1p3p1p4p1(1- p3-p4-p5)Possible worldsPDB

    ObjectTimePersonPLaptop779:07Johnp1Jimp2Book3029:18Maryp3Johnp4Fredp5

    ObjectTimePersonLaptop779:07JohnBook3029:18Mary

    ObjectTimePersonLaptop779:07JohnBook3029:18John

    ObjectTimePersonLaptop779:07JohnBook3029:18Fred

    ObjectTimePersonLaptop779:07JimBook3029:18Mary

    ObjectTimePersonLaptop779:07JimBook3029:18John

    ObjectTimePersonLaptop779:07JimBook3029:18Fred

    ObjectTimePersonLaptop779:07John

    ObjectTimePersonLaptop779:07Jim

    ObjectTimePersonBook3029:18Mary

    ObjectTimePersonBook3029:18John

    ObjectTimePersonBook3029:18Fred

    ObjectTimePerson

  • DefinitionsDefinition: A tuple-disjoint/independent table is:R(A1, A2, , Am, B1, , Bn, P)Definition: A tuple-independent table is:R(A1, A2, , Am, P)Definition: Semantics is given by possible worlds

  • HasObject(Object, Time, Person, P)Meets(Person1, Person2, Time, P) Disjoint Independent Inde- pen- dent

    Disjoint

    ObjectTimePersonPLaptop779:07Johnp1Jimp2Book3029:18Maryp3Johnp4Fredp5

    Person1Person2TimePJohnJim9.12p1MarySue9:20p2JohnMary9:20p3

  • Query SemanticsP(q) = |= q P()A boolean query q is an event: { | |= q }HasObject(MyBook,x,t), EnterRoom(x,CoffeeRoom,t)Did someone take MyBook to the CoffeeRoom ?P(q) =0.96(meaning: quite likely !)q=

  • Discussion of Data ModelTuple-disjoint/independent tables:Simple model, can store in any DBMS

    More advanced models:Symbolic boolean expressionsTrio: add lineageProbabilistic Relational ModelsGraphical modelsFuhr and Roellke[Widom05, Das Sarma06, Benjelloun 06][Getoor2006][Sen&Desphande07]

  • OutlineData model

    Query evaluationProbability of Boolean expressionsFrom queries to Boolean expressionsData complexity of query evaluation

    Challenges

  • Probability of Boolean Expressions = X1X2 X1X3 X2X3Pr()=(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3P(X1)= p1 , P(X2)= p2, P(X3)= p3Compute P()=

    X1X2X3P000000100100011(1-p1)p2p311000101p1(1-p2)p31110p1p2(1-p3)1111p1p2p31

  • BackgroundFix P(X1)= P(X2)= . . . = P(Xn)= 1/2

  • Query q + Database PDB R(x, y), S(x, z)PDB=X1Y1 X1Y2 X2Y3 X2Y4 X2Y5 q= =RpSp

    ABPa1b1p1X1a2b2p2X2

    ACPa1c1q1Y1a1c2q2Y2a2c3q3Y3a2c4q4Y4a2c5q5Y5

  • Application to Query EvaluationCorollary Fix FO query q Exact evaluation of Pr(q) on input PDB is in #PCorollary Fix a conjunctive query q. Approximation of Pr(q) on input PDB is in PTIME (FPTRAS)[Graedel,Gurevitch,Hirsch:1998]

  • Background:Probabilistic NetworksInference: hard in generalKR techniques exploit local properties:E.g. bounded treewidth PTIME = X1Y1X1Y2X2Y3X2Y4X2Y5 X1X2Y2Y1Y3Y4Y5p1p2q1q2q3q4q5R(x, y), S(x, z)[Zabiyaka&Darwiche06]Note: for this query the treewidth is unbounded

  • q =The data complexity of this query is PTIME[D&S2004]R(x, y), S(x, z)safe plan

    ABPa1b1p1a2b2p2

    ACPa1c1q1a1c2q2a2c3q3a2c4q4a2c5q5

    APa11-(1-q1)(1-q2)a21-(1-q3)(1-q4)(1-q5)

    APa1p1(1-(1-q1)(1-q2))a2p2(1-(1-q3)(1-q4)(1-q5))

    P(q) =1 - (1-p1(1-(1-q1)(1-q2))) * (1-p2(1-(1-q3)(1-q4)(1-q5)))

  • Dichotomy TheoremTheorem One of the following holds: Either q is in PTIME Or q is #P hard[D&S2004][Andritsos et al2006]Let q be a conjunctive query without self-joinsIn Case (1) q can be computed by a safe plan and we call it a safe query

  • #P-Hard QueriesPTIME QueriesR(x, y), S(x, z)R(x, y), S(y), T(a, y)R(x), S(x, y), T(y), U(u, y), W(a, u). . .. . .h1 = R(x), S(x, y), T(y)h2 = R(x,y), S(y)h3 = R(x,y), S(x,y)How do we decide if a query is in PTIME or #P hard ?

  • Hierarchical Queriessg(x) = set of subgoals containing the variable x in a key positionDefinition A query q is hierarchical if forall x, y: sg(x) sg(y) or sg(x) sg(y) or sg(x) sg(y) =

  • Case 1: Independent Tuples OnlyFact If q is hierarchical then q is in PTIME[D&S2004]PTIME Queries:The hierarchy gives the safe plan !Root variable u -uConnected components Join

  • Case 1: Independent Tuples OnlyFact If q is non-hierarchical then it is #P-hard.[D&S2004]h1 = R(x), S(x, y), T(y)Proof: it contains h1: q = . . . R(x, . ..), S(x, y, . . .), T(y, . . .) . . .Theorem Testing if q is PTIME or #P-hard is in AC0Recall:#P-hard Queries:h1 is #P-hard (reduction from Partitioned Positive 2DNF)[Provan&Ball83]

  • Case 2: Independent/disjoint TuplesR(x), S(x, y), T(y), U(u, y), W(a, u)RSxyTWRp(x)Sp(x,y)Joiny-uD-yDJoinu-xIuWp(a,u)Up(u,y)JoinxIndependent projectPTIME Queries:URoot variable ICCs JoinConstant key attrs DTp(y)

  • Case 2: Independent/disjoint TuplesTheorem Testing if q is PTIME or #P-hard is PTIME completeRecall:h1 = R(x), S(x, y), T(y)h2 = R(x,y), S(y)h3 = R(x,y), S(x,y)#P-hard Queries:If the safe-plan algorithm fails on q, then q can be rewritten to either h1 or h2 or h3 and hence is #P-hard (see paper for details) #P-hard by reduction from PERMANENT

  • Summary on Query EvaluationWe understand completely only queries w/o self-joins

    Lessons learned from our system MystiQ:When the query is safe:Evaluate it exactly, in the database enginePerformance: close to regular SQL

    When the query is unsafeApproximate it, compute only top-kPerformance: one or two orders of magnitude worse[Re2007]

  • OutlineData model

    Query evaluation

    Challenges

  • Query OptimizationEven a #P-hard query often has subqueries that are in PTIME. Needed:Combine safe plans + probabilistic inferenceInteresting indepence/disjointnessModel a probabilistic engine as black-boxCHALLENGE: Integrate a black-box probabilistic inference in a query processor.[Re2007,Re2007b]

  • Probabilistic Inference AlgorithmsOpen the box ! Logical to physicalExamine specific algorithms from KR:Variable eliminationJunction treesBounded treewidth[Sen&Deshpande2007][Bravo&Ramakrishnan2007]CHALLENGE: (1) Study the space of optimization alternatives. (2) Estimate the cost of specific probabilistic inference algorithms.

  • Open Theory ProblemsSelf-joins are much harder to studySolved only for independent tuplesExtend to richer query languageUnions, predicates (< , , ), aggregatesDo hardness results still hold for Pr = 1/2 ?CHALLENGE: Complete the analysis of the query complexity over probabilistic databases[D&S2007]

  • Complex Probabilistic ModelIndependent and disjoint tuples are insufficient for real applicationsCapturing complex correlat

View more