a generic provenance middleware for database queries, updates, and transactions bahareh sadat arab...

45
A Generic Provenance Middleware for Database Queries, Updates, and Transactions Bahareh Sadat Arab 1 , Dieter Gawlick 2 , Venkatesh Radhakrishnan 2 , Hao Guo 1 , Boris Glavic 1 IIT DBGroup 1 Oracle 2

Upload: ernest-henderson

Post on 11-Jan-2016

227 views

Category:

Documents


1 download

TRANSCRIPT

A Generic Provenance Middleware for Queries, Updates, and Transactions

A Generic Provenance Middleware for Database Queries, Updates, and TransactionsBahareh Sadat Arab1, Dieter Gawlick2, Venkatesh Radhakrishnan2, Hao Guo1, Boris Glavic1

IIT DBGroup1Oracle2OutlineMotivation and OverviewGProM VisionProvenance for Transactions2GProM - Provenance for Queries, Updates, and TransactionsIntroductionData Provenance Information about the origin and creation process dataProvenance tracking for database operationsConsiderable interest from database community in last decadeThe de-facto standard for database provenance [1,2,3,4,5] model provenance as annotations on data (e.g., tuples)compute the provenance by propagating annotations (query rewrite)SELECT DISTINCT OwnerFROM CannAcc;[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, Springer, 2013.[2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 2013.[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373396, 2005.[4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 11511154, 2006.[5] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):514, 2012.3GProM - Provenance for Queries, Updates, and Transactions

Use CasesDebugging data and transformations (queries)[1]Probabilistic databases (queries)[5]Auditing and compliance (transactions and update statements)[6]Understanding data integration transformations (queries and transactions)Assessing data quality and trust (queries and transactions)[7]

Computing provenance for updates and transactions is essential for many use cases.[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, 2006.[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 2012.4GProM - Provenance for Queries, Updates, and TransactionsShortcomings of State-of-the-ArtNo practical implementation for updates No system or model supports transactionsInflexible provenance storage Always on [2,3]On-demand only [1]Query rewrite use atypical access patterns and operator sequences -> leads to poor execution plansMost systems: only one type of provenance

[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 2005.[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 2013.

5GProM - Provenance for Queries, Updates, and TransactionsObjectivesVision: Generic Provenance Database Middleware (GProM).Provenance forQueries, updates, and transactionsUser decides when to compute and store provenanceSupports multiple provenance modelsDatabase-independentTracking provenance of concurrent transactionsReenactment Queries6GProM - Provenance for Queries, Updates, and TransactionsContributionsFirst solution for provenance of transactionsRetroactive on-demand provenance computationUsing read-only reenactmentOnly requires audit log + time travelSupported by most DBMSNo additional storage and runtime overheadNon-invasive provenance computation query rewrite + annotation propagation7GProM - Provenance for Queries, Updates, and TransactionsOutlineMotivation and OverviewGProM VisionProvenance for Transactions8GProM - Provenance for Queries, Updates, and TransactionsSystem ArchitectureDatabase independent middlewarePlug-able parser and SQL code generatorInternal query representationRelational Algebra Graph Model (AGM)Core driver: Query rewritesProvenance ComputationFlexible storage policies for provenanceProvenance import/exportAGM Optimizer (rewritten queries)Extensibility: Rewrite Specification Language (RSL)Initial prototype build on-top of Oracle

9GProM - Provenance for Queries, Updates, and TransactionsGProM Overview10GProM - Provenance for Queries, Updates, and Transactions

Provenance ComputationQuery rewriteTake original query q and rewrite into q+ Computes original results + provenancePropagate provenance through operations11GProM - Provenance for Queries, Updates, and Transactions

Example RewriteInput:SELECT DISTINCT u.Owner FROM Usacc u, CanAcc c WHERE u.ID = c.ID;Rewrite Parts:USacc SELECT ID, Owner, Balance, Type, ID AS P1, Owner AS P2, Balance AS P3, Type AS P4 FROM USaccCanAcc SELECT ID, Owner, Balance, Type, ID AS P5, Owner AS P6, Balance AS P7, Type AS P8 FROM CanAccWHERE u.ID = c.ID WHERE u.ID = c.ID SELECT DISTINCT Owner SELECT Owner, P1, P2, P3, P4, P5, P6, P7, P8 Output:SELECT u.Owner, P1, P2, P3, P4, P5, P6, P7, P8 FROM (SELECT ID, Owner, Balance, Type, ID AS P1, Owner AS P2, Balance AS P3, Type AS P4 FROM USacc) u (SELECT ID, Owner, Balance, Type, ID AS P5, Owner AS P6, Balance AS P7, Type AS P8 FROM CanAcc) cWHERE u.ID = c.ID;12GProM - Provenance for Queries, Updates, and TransactionsProvenance ComputationOperates on relational algebra representation of queriesFixed set of rewrite rules per provenance type:One per type of algebra operatorRecursive top-down rewriteFor each relation access: duplicate attributes as provenanceFor each operator: replace with algebra graph that propagates provenance annotationsComposable13GProM - Provenance for Queries, Updates, and Transactions

Supporting Past Queries, Updates, and TransactionsOnly needs audit log and time travelsupported by most DBMSSufficient for provenance of past queries [4]Our contributionSufficient for updates and transactions[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, 2010.

14GProM - Provenance for Queries, Updates, and TransactionsProvenance Generation and Storage PoliciesGProM defaultOnly compute provenance if explicitly requestedUser can register storage policiesWhen to store which type of provenancePOLICY storeOnR { FIRE ON Query, Insert q WHEN Root(q) +=> Table(R) COMPUTE PI-CS STORE AS NEW TABLE NAMING SCHEME Hash}15GProM - Provenance for Queries, Updates, and Transactions

Optimizing Rewritten QueriesQuery rewrite use atypical access patterns and operator sequences leads to poor execution plansOptimization for rewritten queriesHeuristic Cost-based

SELECT ID, Owner, Balance, CASE WHEN Balance > 1000000 THEN 'Premium ' ELSE Type END AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_TypeFROM u1...SELECT ID, Owner, Balance, 'Premium ' AS Type, prov_CanAcc_ID, prov_CanAcc_Owner, prov_CanAcc_Balance, prov_CanAcc_Type, prov_USacc_ID, prov_USacc_Owner, prov_USacc_Balance, prov_USacc_TypeFROM u1WHERE Balance > 1000000UNION ALLSELECT * FROM u1WHERE (Balance > 1000000) IS NOT TRUE16GProM - Provenance for Queries, Updates, and TransactionsRewrite ExtensibilityExtensible using Rewrite Specification Language (RSL)Concise specification of rewrite rulesRULE mergeSelections { FOR q => c => g WHERE q->type = selection AND c->type = selection REWRITE INTO selection [pred = q->pred AND c->pred] => g}17

GProM - Provenance for Queries, Updates, and TransactionsOutlineMotivation and OverviewGProM VisionProvenance for Transactions

18GProM - Provenance for Queries, Updates, and TransactionsProvenance of Transactions19

GProM - Provenance for Queries, Updates, and TransactionsProvenance of TransactionsINSERT INTO USacc (SELECT ID, Owner, Balance, Standard AS Type FROM CanAcc WHERE Type = US_dollar);UPDATE USacc SET Type = PremiumWHERE Balance > 1000000;COMMIT;20GProM - Provenance for Queries, Updates, and Transactions

Provenance of TransactionsINSERT INTO Usacc(SELECT ID, Owner, Balance, Standard AS TypeFROM CanAccWHERE Type = US_dollar);UPDATE Usacc SET Type = PremiumWHERE Balance > 1000000;21GProM - Provenance for Queries, Updates, and Transactions

u1u2Provenance of TransactionsOur Approach: Reenactment + Provenance Propagation

Currently supportsSnapshot IsolationStatement-level Snapshot Isolation22GProM - Provenance for Queries, Updates, and TransactionsGather Transaction InformationConstructUpdateReenactmentQueryRewrite For Provenance ComputationExecuteQuery1Construct Transaction Reenactment Query2345

1.Gather Transaction InformationRetrieve SQL statements of transaction from audit log

Update u1:INSERT INTO USacc (SELECT ID, Owner, Balance, Standard AS Type FROM CanAcc WHERE Type = US_dollar);

Update u2:UPDATE Usacc SET Type = Premium WHERE Balance > 1000000;

23GProM - Provenance for Queries, Updates, and Transactions2. Translate Updates: ReenactmentUpdate reads table version and outputs updated table versionMultiple versions of the databaseEach modification of a tuple t causes a new version to be createdOld tuple versions are kept (SI)Add version annotation to provenance of each updated rowUse semi-ring model24GProM - Provenance for Queries, Updates, and TransactionsUPDATE Usacc SET Type=PremiumWHERE Balance>1000000;

2.Translate UpdatesConstruct update reenactment query Simulates effect of updateRead DB version seen by update using time travelQuery result = updated table (Annotation-Equivalent)

SELECT ID, Owner, Balance, Standard AS TypeFROM CanAcc AS OF SCN 3652WHERE Type=US_dollarUNION ALLSELECT * FROM Usacc AS OF SCN 3652;25

GProM - Provenance for Queries, Updates, and TransactionsUPDATE Usacc SET Type = PremiumWHERE Balance > 1000000;SELECT ID, Owner, Balance, Premium AS TypeFROM Usacc AS OF SCN 3652WHERE Balance>1000000UNION ALLSELECT *FROM Usacc AS OF SCN 3652WHERE (Balance>1000000) IS NOT TRUE;INSERT INTO Usacc(SELECT ID, Owner, Balance, Standard AS TypeFROM CanAccWHERE Type = US_dollar);3. Construct Reenactment QuerySimulates the whole transactionAnnotation-Equivalent to original transactionMerge reenactment queries based on concurrency control protocolEach concurrency control requires a different merge processSERIALIZABLE (Snapshot isolation) -> modifications before the transaction started + previous updates of the transactionREAD COMMITTED (Snapshot isolation) -> sees committed changes by concurrent transaction

WHIT U1 AS(SELECT ID, Owner, Balance, Standard AS TypeFROM CanAcc AS OF SCN 3652WHERE Type=US_dollarUNION ALLSELECT * FROM Usacc AS OF SCN 3652);

SELECT ID, Owner, Balance, Premium AS TypeFROM U1WHERE Balance>1000000UNION ALLSELECT * FROM U1WHERE (Balance>1000000) IS NOT TRUE;26

GProM - Provenance for Queries, Updates, and Transactions4. Rewrite For Provenance ComputationRewrite reenactment query to compute provenance using annotation propagation

WITHu1 AS(SELECT ID, Owner, Balance, Standard AS Type, ID AS prov_CanAcc_ID, . . . NULL AS prov_USacc_ID, . . . 1 AS updated,FROM CanAcc AS OF SCN 3652WHERE Type = US dollar UNION ALLSELECT ID , Owner , Balance , Type , NULL AS prov_CanAcc_ID, . . . ID AS prov_USacc_ID, . . . 0 AS updatedFROM USacc AS OF SCN 3652),. . .u1 AS(SELECT . . .

27

GProM - Provenance for Queries, Updates, and Transactions4. Execute QueryExecute query to retrieve provenance

Updated USacc TuplesProvenance from CanAccProvenance from USaccIDOwnerBalanceTypeP1P2P3P4P5P63Alice Bright1,500,000Premium3Alice Bright1,500,000NULLNULLNULL5Mark Smith50Standard5Mark Smith50NULLNULLNULL28GProM - Provenance for Queries, Updates, and Transactions

ConclusionsWe present our vision for GProMDatabase-independent middleware for computing provenance of queries, updates, and transactions.First solution for provenance of transactions Query rewrite techniques on steroids:Provenance computationTransaction reenactmentProvenance translationProvenance storageOptimizationExtensible through RSL language29GProM - Provenance for Queries, Updates, and TransactionsFuture WorksImplementing additional provenance typesComprehensive study of heuristic and cost-based optimizations Design and implementation of RSLImplementing additional provenance formatsStudy reenactment for other concurrency control mechanismsLocking protocols (2PL)Investigate additional Use-cases for ReenactmentTransaction backoutRetroactive What-if analysis30GProM - Provenance for Queries, Updates, and TransactionsQuestions?Homepage: Bahareh: http://www.cs.iit.edu/~dbgroup/people/barab.phpBoris: http://www.cs.iit.edu/~glavic/DBGroup: http://www.cs.iit.edu/~dbgroup/GProM Project (partially funded by Oracle)http://www.cs.iit.edu/~dbgroup/research/oracletprov.phpPermhttp://www.cs.iit.edu/~dbgroup/research/perm.php

31GProM - Provenance for Queries, Updates, and TransactionsReferences[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of Elegance in the Theory and Practice of Computation, pages 291320. Springer, 2013.[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373396, 2005.[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS, 38(3): 19, 2013.[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311322, 2010.[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. In VLDB, pages 11511154, 2006.[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):514, 2012.

32Q-BombOne pattern that arises from reenactment are long chains of SELECT clauses using CASEEach level references attributes from next level multiple timesSubquery pull-up creates expressions of size exponential in the number of SELECT clausesIn praxis: optimization never finishesMinimal example using one row tableSELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, bFROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, bFROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, bFROM R33Example Provenance Computation

34Example Update Reenactment35

Example Trans. Reenactment36

Rewrite Reenactment Query37

Execute Rewritten Query38

Types of Update Operations - InsertINSERT INTO R VALUES (v1, ... ,vn);INSERT INTO R (q);39(SELECT * FROM R AS OF t)UNION ALL(SELECT v1 AS a1, ... , vn AS an);(SELECT * FROM R AS OF t)UNION ALL(q(t));Types of Update Operations - DeleteDELETE FROM R WHERE C ;SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE;40Types of Update Operations - UpdateUpdate executed at time tFind tuples where Condition holds and update the attribute valuesFind tuples where NOT Condition holds Union these two sets

UPDATE R SET A WHERE C ;(SELECT A FROM R AS OF t WHERE C)UNION ALL(SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE)41READ COMMITTEDStatement of a transaction T sees committed changes by concurrent transactionFor a given update we need to combinetuples produced by previous statements of same transactiontuples produced by transactions that committed before updateObservationsOnce a transaction T modifies a tuple t, no other transaction can access t until T commitsLet ui be the update executed at time x of T that first modifies tui will read the latest version committed xIf we know ui then updates of T before x do not have to look at tConsider the database version 1 time unit (C-1) before commit of TThis contains all the tuple versions seen by the first update of T updating each individual tupleLet t be a tuple version in this version and its start time is yWe know that updates from T which executed before y cannot have updated tWe can use version C-1 as input for reenactment as long as we hide tuple version t at y from an reenactment of an updated executed at x with x < y

42READ COMMITTEDu1 AS(SELECTCASE WHEN Balance