management of probabilistic data: foundations and challenges
Click here to load reader
Post on 18-Jan-2016
20 views
Embed Size (px)
DESCRIPTION
Management of Probabilistic Data: Foundations and Challenges. Nilesh Dalvi and Dan Suciu Univerisity of Washington. Databases Are Deterministic. Applications since 1970’s required precise semantics Accounting, inventory Database tools are deterministic A tuple is an answer or is not - PowerPoint PPT PresentationTRANSCRIPT
Management of Probabilistic Data: Foundations and ChallengesNilesh Dalvi and Dan SuciuUniverisity of Washington
Databases Are DeterministicApplications since 1970s required precise semanticsAccounting, inventoryDatabase tools are deterministicA tuple is an answer or is notUnderlying theory assumes determinismFO (First Order Logic)
Future of Data ManagementWe need to cope with uncertainties !
Represent uncertainties as probabilities
Extend data management tools to handle probabilistic data
Major paradigm shift affecting both foundations and systems
Uncertainties EverywhereIn the schema mappings:Data spacesPay as you go data integrationIn the data mapping Life science data integrationObject reconciliation, fuzzy joinsIn the data itselfData by the masses Information ExtractionRFID data, sensor data[Halevy2007][Philippi&Kohler2006][Gupta&Sarawagi2006][Welbourne2007][Arasu06]
Example 1Data Integration in Life SciencesU2 integrates several biological databases[B.Louie et al.2007]User types: Gene ABCD1U2 finds 80 related proteinsRanks them by uncertainty scoreCorrect 9 functions are among top 11Example: find functional annotations of ABCD1EntrezProtein, Pfam, TIGRFAM, NCBI Blast, EntrezGeneNeed to represent uncertainties explicitly
Example 2Information Extraction[Gupta&Sarawagi2006]...52 A Goregaon West Mumbai ...Here probabilities are meaningful20% of such extractions are correct
IDHouse-NoStreetCityP152Goregaon WestMumbai0.1152-AGoregaon WestMumbai0.4152GoregaonWest Mumbai0.2152-AGoregaonWest Mumbai0.22. . . .. . . .. . . .. . . .2 . . . .
Example 3RFID Ecosystem at UW[Welbourne2007]
RFID data = noisySIGHTING(tagID, antennaID, time)
Derived data = ProbabilisticJohn entered Room 524 at 9:15 prob=0.6John carried laptop x77 at 11:03 prob=0.8. . .QueriesWhich people were in Room 478 yesterday ?Massive amounts of probabilistic data from RFIDs, sensors
A Model for UncertaintiesData is probabilistic
Queries formulated in a standard language
Answers are annotated with probabilitiesThis talk: Probabilistic Databases
Probabilistic databases:Long HistoryCavallo&Pitarelli:1987Barbara,Garcia-Molina, Porter:1992Lakshmanan,Leone,Ross&Subrahmanian:1997Fuhr&Roellke:1997Dalvi&S:2004Widom:2005Focus today: the Query Evaluation Problem
Has this been solved by AI ?Input: KBFix q Input: DB
AIDatabasesDeterministicTheorem proverQuery processingProbabilisticProbabilistic inference[this talk]
OutlineData model
Query evaluation
Challenges
What is a Probabilistic Database (PDB) ?HasObjectpWhat does it mean ?KeysProbability[Barbara et al.1992]Non-keys
ObjectTimePersonPLaptop779:07John0.62Jim0.34Book3029:18Mary0.45John0.33Fred0.11
BackgroundFinite probability space = (, P)= {1, . . ., n} = set of outcomesP : [0,1]P(1) + . . . + P(n) = 1Event: E , P(E) =E P()Independent: P(E1 E2) = P(E1) P(E2)Mutual exclusive or disjoint: P(E1E2) = 0
Possible Worlds Semantics={}p1p3p1p4p1(1- p3-p4-p5)Possible worldsPDB
ObjectTimePersonPLaptop779:07Johnp1Jimp2Book3029:18Maryp3Johnp4Fredp5
ObjectTimePersonLaptop779:07JohnBook3029:18Mary
ObjectTimePersonLaptop779:07JohnBook3029:18John
ObjectTimePersonLaptop779:07JohnBook3029:18Fred
ObjectTimePersonLaptop779:07JimBook3029:18Mary
ObjectTimePersonLaptop779:07JimBook3029:18John
ObjectTimePersonLaptop779:07JimBook3029:18Fred
ObjectTimePersonLaptop779:07John
ObjectTimePersonLaptop779:07Jim
ObjectTimePersonBook3029:18Mary
ObjectTimePersonBook3029:18John
ObjectTimePersonBook3029:18Fred
ObjectTimePerson
DefinitionsDefinition: A tuple-disjoint/independent table is:R(A1, A2, , Am, B1, , Bn, P)Definition: A tuple-independent table is:R(A1, A2, , Am, P)Definition: Semantics is given by possible worlds
HasObject(Object, Time, Person, P)Meets(Person1, Person2, Time, P) Disjoint Independent Inde- pen- dent
Disjoint
ObjectTimePersonPLaptop779:07Johnp1Jimp2Book3029:18Maryp3Johnp4Fredp5
Person1Person2TimePJohnJim9.12p1MarySue9:20p2JohnMary9:20p3
Query SemanticsP(q) = |= q P()A boolean query q is an event: { | |= q }HasObject(MyBook,x,t), EnterRoom(x,CoffeeRoom,t)Did someone take MyBook to the CoffeeRoom ?P(q) =0.96(meaning: quite likely !)q=
Discussion of Data ModelTuple-disjoint/independent tables:Simple model, can store in any DBMS
More advanced models:Symbolic boolean expressionsTrio: add lineageProbabilistic Relational ModelsGraphical modelsFuhr and Roellke[Widom05, Das Sarma06, Benjelloun 06][Getoor2006][Sen&Desphande07]
OutlineData model
Query evaluationProbability of Boolean expressionsFrom queries to Boolean expressionsData complexity of query evaluation
Challenges
Probability of Boolean Expressions = X1X2 X1X3 X2X3Pr()=(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3P(X1)= p1 , P(X2)= p2, P(X3)= p3Compute P()=
X1X2X3P000000100100011(1-p1)p2p311000101p1(1-p2)p31110p1p2(1-p3)1111p1p2p31
BackgroundFix P(X1)= P(X2)= . . . = P(Xn)= 1/2
Query q + Database PDB R(x, y), S(x, z)PDB=X1Y1 X1Y2 X2Y3 X2Y4 X2Y5 q= =RpSp
ABPa1b1p1X1a2b2p2X2
ACPa1c1q1Y1a1c2q2Y2a2c3q3Y3a2c4q4Y4a2c5q5Y5
Application to Query EvaluationCorollary Fix FO query q Exact evaluation of Pr(q) on input PDB is in #PCorollary Fix a conjunctive query q. Approximation of Pr(q) on input PDB is in PTIME (FPTRAS)[Graedel,Gurevitch,Hirsch:1998]
Background:Probabilistic NetworksInference: hard in generalKR techniques exploit local properties:E.g. bounded treewidth PTIME = X1Y1X1Y2X2Y3X2Y4X2Y5 X1X2Y2Y1Y3Y4Y5p1p2q1q2q3q4q5R(x, y), S(x, z)[Zabiyaka&Darwiche06]Note: for this query the treewidth is unbounded
q =The data complexity of this query is PTIME[D&S2004]R(x, y), S(x, z)safe plan
ABPa1b1p1a2b2p2
ACPa1c1q1a1c2q2a2c3q3a2c4q4a2c5q5
APa11-(1-q1)(1-q2)a21-(1-q3)(1-q4)(1-q5)
APa1p1(1-(1-q1)(1-q2))a2p2(1-(1-q3)(1-q4)(1-q5))
P(q) =1 - (1-p1(1-(1-q1)(1-q2))) * (1-p2(1-(1-q3)(1-q4)(1-q5)))
Dichotomy TheoremTheorem One of the following holds: Either q is in PTIME Or q is #P hard[D&S2004][Andritsos et al2006]Let q be a conjunctive query without self-joinsIn Case (1) q can be computed by a safe plan and we call it a safe query
#P-Hard QueriesPTIME QueriesR(x, y), S(x, z)R(x, y), S(y), T(a, y)R(x), S(x, y), T(y), U(u, y), W(a, u). . .. . .h1 = R(x), S(x, y), T(y)h2 = R(x,y), S(y)h3 = R(x,y), S(x,y)How do we decide if a query is in PTIME or #P hard ?
Hierarchical Queriessg(x) = set of subgoals containing the variable x in a key positionDefinition A query q is hierarchical if forall x, y: sg(x) sg(y) or sg(x) sg(y) or sg(x) sg(y) =
Case 1: Independent Tuples OnlyFact If q is hierarchical then q is in PTIME[D&S2004]PTIME Queries:The hierarchy gives the safe plan !Root variable u -uConnected components Join
Case 1: Independent Tuples OnlyFact If q is non-hierarchical then it is #P-hard.[D&S2004]h1 = R(x), S(x, y), T(y)Proof: it contains h1: q = . . . R(x, . ..), S(x, y, . . .), T(y, . . .) . . .Theorem Testing if q is PTIME or #P-hard is in AC0Recall:#P-hard Queries:h1 is #P-hard (reduction from Partitioned Positive 2DNF)[Provan&Ball83]
Case 2: Independent/disjoint TuplesR(x), S(x, y), T(y), U(u, y), W(a, u)RSxyTWRp(x)Sp(x,y)Joiny-uD-yDJoinu-xIuWp(a,u)Up(u,y)JoinxIndependent projectPTIME Queries:URoot variable ICCs JoinConstant key attrs DTp(y)
Case 2: Independent/disjoint TuplesTheorem Testing if q is PTIME or #P-hard is PTIME completeRecall:h1 = R(x), S(x, y), T(y)h2 = R(x,y), S(y)h3 = R(x,y), S(x,y)#P-hard Queries:If the safe-plan algorithm fails on q, then q can be rewritten to either h1 or h2 or h3 and hence is #P-hard (see paper for details) #P-hard by reduction from PERMANENT
Summary on Query EvaluationWe understand completely only queries w/o self-joins
Lessons learned from our system MystiQ:When the query is safe:Evaluate it exactly, in the database enginePerformance: close to regular SQL
When the query is unsafeApproximate it, compute only top-kPerformance: one or two orders of magnitude worse[Re2007]
OutlineData model
Query evaluation
Challenges
Query OptimizationEven a #P-hard query often has subqueries that are in PTIME. Needed:Combine safe plans + probabilistic inferenceInteresting indepence/disjointnessModel a probabilistic engine as black-boxCHALLENGE: Integrate a black-box probabilistic inference in a query processor.[Re2007,Re2007b]
Probabilistic Inference AlgorithmsOpen the box ! Logical to physicalExamine specific algorithms from KR:Variable eliminationJunction treesBounded treewidth[Sen&Deshpande2007][Bravo&Ramakrishnan2007]CHALLENGE: (1) Study the space of optimization alternatives. (2) Estimate the cost of specific probabilistic inference algorithms.
Open Theory ProblemsSelf-joins are much harder to studySolved only for independent tuplesExtend to richer query languageUnions, predicates (< , , ), aggregatesDo hardness results still hold for Pr = 1/2 ?CHALLENGE: Complete the analysis of the query complexity over probabilistic databases[D&S2007]
Complex Probabilistic ModelIndependent and disjoint tuples are insufficient for real applicationsCapturing complex correlat