tables as a paradigm for querying and restructuring

Upload: vthung

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    1/11

    Tables As a Paradigm for Querying and Restructuring*(Extended Abstract)

    Marc Gyssenst Laks V.S. Lakshmanan$ Iyer N. Subramanian$University of Limburg Concordia University Concordia University

    AbstractTables are one of the most natural representations of real-lifedata. Previous table-based data models (such as relational,nested relational, and complex objects models) capture onlya limited variety of real-life tables. In this paper, we studythe foundations of tabular representations of data. Wepropose the tabular database model for handling a broadclass of natural data represent at i ons and develop tabularalgebra as a language for querying and restructuring tabulardata. We show that the tabular algebra is complete fora very general class of transformations and show thatseveral languages designed for very different purposes cannaturally be embedded into the tabular model, We alsodemonstrate the applicability of our model as a theoreticalfoundation for on-line analytical processing (OLAP), anemerging technology for complementing the robust datamanagement and transaction processing of DBMS withpowerful tools for data analysis.

    1 IntroductionTables are one of the most natural ways in whichreal-life data can be represented. Indeed, the successand popularity of the relational model (see [15]) is atestimony to this. The relational model, however, onlyaccounts for a very limited variety of tables possible.Real-world tables may have names for their columns(like relations) and rows (unlike relations), and these

    *This research was supported in part by grants from theNatural Sciences and Engineering Research Council of Canadaand the Fends Pour Formation De Chercheurs Et LAide A LaRecherche of Quebec.

    t DePt. WNI, University of Limburg, B-359o Diepenbeek,Belgium.E-mail: gyssens@charlie. luc. ac. be.i D.pt. of Computer Science, Concordia Universi ty, Montreal,Quebec, CanadaEmail: {laks, subbu}@cs. concordia. ca.Permission to make digital/hatd copies of el l or part of t tds material forperaoml or classroom US. is granted without fee provided that the copiesare not made or dktributed for profit or sommr,mial advanrage, the COpy-right notice, the title of the publication and its date appear, and ootice isgiven thet copyright is by permission of the ACM, Inc. To copy otherwise,to twpublieb, to post on servers or to redistribute to fiats, requires specificpermission and/or fee.PODS 96, Montreal Quebec Canada01996 ACM 0-89791-781-2/96/06. .$3.50

    names need not be distinct (unlike in relations). Toa limited extent, the nested relational and complex-object models (see [15]) mitigate the limitations of therelational model by allowing nesting and promotingstructure sharing. These models, however, still fail toexploit the full power of tables. Figure 1 shows severaldatabases as sets of tables representing the same (sales)data in a variety of ways, and illustrates the points madeabove.For now, we only consider the tables and parts oftables outlined in bold. The database SalesInfolis a relational representation of the sales data. Thedatabases SalesInfo2SalesInfo4 fall outside thetraditional relational model in that (some) rows havenames just like columns. Compared to SalesInf oilSalesInfo2 shows the sales data organized per region,thus facilitating a quick comparison of the performanceof each part in various regions. Notice that the columnnames (usually called attributes) in this example are notall distinct. The database Sales Info3 shows the salesdata for each combination of a part and a region. Here,row and column names are actually data! Noticethat, unlike in the relational model, the width of thetable Sales in both SalesInfo2 and SalesInfo3 is notfixed, but depends on the particular instance. Finally,in Sales Info4, there is a separate table for each region.All tables in this database have the same name; theirnumber depends on the particular instance.In summary, tables as opposed to relations offer asymmetry between rows and columns and the latitudethat row and column names may occur multiply ormay even be absent. Exploiting this symmetry andflexibility allows for a much broader class of naturaldata representations than captured by the traditionalrelational model,Having more liberal tabular representations available fordatabases is not only interesting from a statical but alsofrom a dynamical point of view. It has been pointedout (e.g., see [7, 8]) that many applications can signifi-cantly benefit from the integration of database systems

    93

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    2/11

    (whose strength is efficient and robust on-line trans-action processing (OLTP) and handling large volumesof data), with analytical tools like spreadsheets (whichoffer strong on-line analytical processing (OLAP) capa-bilities). Indeed, spreadsheets model data in the form oftables (somewhat more liberally than in the relationalmodel) and have several powerful analytical functionsbuilt into them. Examples include row and columnarithmetic, generalized aggregation on arbitrary blocksof values drawn from tables, and the ability to invokeexternal functions. It has been pointed out [7, 8] that anintegration of relational database systems and spread-sheets will combine their complementary strengths inOLAP and OLTP respectively, leading to a powerfulenvironment for data processing. Such an integrationcalls for a powerful model and language that supportsconvenient restructuring of data between various t a bu-lar representations.In this paper, we propose the tabular database modelas a general-purpose model allowing a very broad classof natural data representations, including those coveredby the relational model and spreadsheets. Intuitively,a tabular database is a set of tables. Each table hasa name, called the table name, which appears as thetop-most left-most entry in it (see the tables of thedatabases SalesInf 02-SalesInf 04 in Figure 1). Theother entries appearing in the top-most row are columnattributes, and the other entries in the left-most columnare row attributes. The remaining entries in a table canbe thought of as data entries. Notwithstanding, datacan also occur in the attribute positions (see the tableSales of the database Sales Inf 03 in Figure 1). Bothrow and column attributes are optional. Whenever acertain entry is not applicable, we indicate this by thespecial symbol 1, thought of the inapplicable null (asshown in the tables of the databases Sales Inf 02 andSalesInfo3 in Figure 1).We also study the problems of querying and of restruc-turing among different tabular represent ations of similardata. This problem has applications in schema restruc-turing, view maintenance, integration of heterogeneousdatabases, interoperability, and integration of databasesystems with analytical tools such as spreadsheets, cur-rently active areas of research.The main contributions of the paper are the follow-ing. (1) We propose the tabular database model, whichis very simple and yet powerful in allowing a very broadclass of natural data representations. (2) We proposean algebraic language, called the tabular algebra, forquerying and restructuring tabular data. (3) We de-velop notions of genericity and completeness that insome sense capture the class of all meaningful queriesand all conceivable forms of restructuring on tabulardata representation. We also prove the surprising result

    that our algebra above, which is essentially based onfour very simple and natural forms of restructuring, iscomplete w.r.t. our criteria.We also compare our model and language with existingones and bring out their power and generality. Amongother things, we show the following, (4) The graph-based object-oriented data model GOOD recently pro-posed by Gyssens et al. [9, 4, 3] can be embeddedwithin the tabular database model. In particular, everyGOOD query can be expressed in the tabular algebra.This observation also provides a means to embed othermodels encompassed by GOOD, such as the nested andcomplex-object models, in the tabular model. (5) Thesyntactic higher-order logic-based model of SchemaLogrecently proposed by Lakshmanan et al. [11, 12] canalso be embedded within the tabular model. In par-ticular, every query or restructuring transformation ex-pressible in SchemaLog (without function symbols) canalso be expressed in tabular algebra. (6) The tabularalgebra can serve as a fundamental query and restruc-turing language for OLAP-based information systems,To our knowledge, the theoretical foundations for OLAPsystems have not been clearly developed in the OLAPliterature, and the tabular algebra is the first fundamen-tal querying and restructuring language to be proposedfor such systems.Before concluding this section, let us illustrate theflexibility and power of tables compared to relations.Revisit the tables representing sales data, shown inFigure 1. Suppose we wish to include summary datain these databases. Such data can come from, e.g.,OLAP tools. The summary could include sales totalsper part and per region and grand-total sales. In therelational representation (the database Sales Inf ol),we are forced to store such information in separaterelations. By contrast, we can easily absorb suchsummary data in the tables of Sales Info2Sales Inf 04as shown in regular outline in Figure 1. It has beenpointed out that tabular representations are also usefulin the context of data mining ([16]),Finally, as an illustration of the power of the tabularalgebra, we mention that it is possible to restructurethe data from any of the representations SalesInfo2SalesInfo4 in Figure 1 to any other.For lack of space, this extremely terse extended abstractsuppresses many of the details and all proofs. All of thesecan be found in the full paper ([5]).2 The Tabular Database ModelIn this section, we present the tabular database model.Thereto, we distinguish two sorts of symbols: Af (callednames) and V (called values). Names can be thoughtof as a generalization of relation and attribute names.

    94

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    3/11

    SalesInfolSalesPart Region Soldnuts east 50nuts west 60nuts south 40screws west 50screws north 60screws south 50bolts east 70bolts north 40

    SalesInfo2

    TotalPartSales

    m

    Part Totalnuts 150screws 160bolts 110

    GrandTotal

    ETot al420

    TotalRegionSalesReeion Totaleast 120west 110north 100south 90

    Sales Part Sold Sold Sold Sold SoldRegion -1 east west north south Total

    J- nuts 50 60 1 40 1501 screws J- 50 60 50 1601 bolts 70 1 40 1. 110

    Total L 120 110 100 90 420

    SalesInfo3

    Sales nuts screws bolts Totaleast 50 J- 70 120west 60 50 1 110north 1 40 50 100south 40 50 L 90Total 150 160 110 420

    SalesInfo4

    2=ales Part SoldRegion east eastL nuts 501 bolts 70Total L 120I I Im Figure 1:mmSales Part SoldRegion I Total TotalI 1 I nuts 150J- screws 160 I1 I bolts 110Total I 1 420

    Some examplesof tabular databases.

    95

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    4/11

    To allow a broad class of data representations, weallow names to occur also in those positions in tables,normally thought of as data entry positions. Similarly,we allow values to occur also in attribute positions.Our operations will be allowed to distinguish individualnames while, for genericit y reasons [6], they will notbe allowed to distinguish individual values. In concreteexamples, we shall distinguish names from values bywriting names in type-writer font. As is the case in real-life tables, our tables need not have entries for every rowand column combination. To deal with this possibleabsence of entries, we introduce the (inapplicable)null value 1. The set of all symbols is then $ =Afuvu l-.The presence of 1 requires an adapted notion ofequality. Let A and f? be two sets of symbols, i.e.,A, B c S. We say that d is weakiy contained in B,denoted A P B, if A \ {J_} G 1?\ {1}. We say that Aand 1? are weakiy equal, denoted A = l?, if A ~ 1? andB~A.A tabular database, now, is a set of tables. In thisway, a tabular database can be thought of as a three-dimensional table. Formally, a table is a total mappingfrom the Cartesian product of two initial segments ofthe natural numbers into S. Hence we can think of atable as a matrix. If T is a table with row numberso,. ... m and column numbers 0, . . . . n, then r is calleda table of width n and height m. The width and heightof ~ are denoted width(r) and height(~), respectively.For 1, a finite sequence over {O, . . . . m}, and J, afinite sequence over {O, . . . . n}, # denotes the subtableof ~ formed by the rows and columns indicated. Inparticular, for O < i < height(r) and O < j < widt~(r),Ti denotes the ith row, r~ the jth column, and ~ theentry ~(i, j). The sequence (i + 1)..heighi(~) will beabbreviated as > i and the sequence (j + 1).. widih(~)as > j. (The index position will disambiguate betweenthe two possibilities.) In particular, >0 will be furtherabbreviated to >.

    m7: ToT: T; IL I I

    Figure 2: Diagrammatic representation of a table.Using this notation in a block diagram, four regionscan be distinguished in a table T, as shown in Figure 2.The entry r~ is called the table name, the entries &are called the column attributes, the entries 7$ are calledthe row attributes, and the entries T: are called the dataentries.

    The databases SalesInf 02SalesInf 04 in Figure 1 areexamples of tabular databases satisfying the definitionsgiven here. They will be used as a running examplethroughout this paper.The possibility of multiple occurrences of column androw attributes requires appropriate terminology. If ~ isa table, ~i a row of r, and a c S some symbol, thenri (a) is the sei of all data entries appearing in thosecolumns named a, i.e. r~(a) = {~ I j >0 and ~ = a}.Whenever p and u are (not necessarily distinct) tables,and p; and c~ are rows respectively of p and u, thenp; is said to be subsumed by a~, denoted pi ~ u~, if,for each column attribute a in p or in u, pi(a) ~ m(u),i.e. the set pi(a) is weakly contained in the set a~ (a).Finally, the rows pi and erk are said to subsume eachother, denoted pi = ok, if pi ~ ~k and crk ~ p;. Similarterminology can be developed for columns.3 The Tabular AlgebraIn this section, we describe the tabular algebra (TA),partly informally, due to space restrictions. Formalcounterparts of informal descriptions can be foundin the full paper [5]. The tabular algebra con-sists of assignment statements of the form T +-(operation) (parameter list) (argument list), with T a ta-ble name parameter, augmented with an iteration con-struct. The precise meaning of the parameters will beclarified in Section 3.6. For now, they can be consideredas table names, column attributes, or sets of column at-tributes, respectively. The argument list is a sequence oftable name parameters. Each time an assignment state-ment as above is invoked, the operation is executed oneach sequence of tables in the database, whose namesmatch with the table name parameters in the argumentlist. The resulting tables are named T.The effect of the tabular algebra operations will some-times be explained verbally and sometimes by means ofa diagrammatic representation in t he st yle of Figure 2.If an entry in a box of such a diagram contains positionindex parameters, these parameters run over all applica-ble values, subject to a condition in a condition box andthe corresponding entries are repeated accordingly. Therange of the repetition is indicated with dashed boxes(e.g. see Figure 3). If a box contains a single entry (e.g.,T or l.), that entry is supposed to be repeated asoften as is required to match the adjacent boxes, bothhorizontally and vertically.Whichever way we choose below to explain a particularoperation, we assume that p and u are tables with namesR and S, respectively. We then describe the effect ofapplying an assignment statement with that operationand argument(s) R (and S in case of a binary operation)on the tables p and u. For convenience, we refer to the

    96

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    5/11

    union difference Cartesian product

    Figure 3: Diagrammatic representation of the effect of union, difference, and Cartesian product.

    top row of each table as its attribute row, and all otherrows as data rows. Similar convention will also be usedfor columns,

    3.1 Traditional operationsA first set of operations are adaptations of the tra-ditional relational algebra operations to our setting:union, difference, Cartesian product, renaming, projection,and selection. Figure 3 shows the effect of T - R US,T+ R\ S,and T_ Rx Son the tables pandu. Notice that the operation is carried out on all (orall pairs of, as appropriate) tables with names R and Sand the result is named T. Notice also that union anddifference are defined in such a way that they alwaysexist, Intersection is defined in terms of difference in theusual way. The classical versions of these operations canbe simulated by using their tabular versions togetherwith the redundancy removal operations, as discussedin Section 3.4, The effect of T * RENAMEBt-A (R),T +--- SELECTA.B(R), and T - PROJECTA(R), withA and B attribute parameters and A an attribute-setparameter, is also defined in the usual way. We merelyremark that weak equality is used instead of classicalequality in the definition of selection.

    3.2 Restructuring OperationsWe consider four restructuring operations: grouping,merging, splitting, and collapsing. Grouping and merging(respectively splitting and collapsing) can be seen asinverses of each other, Since the formal definitions arerather involved, we limit ourselves to giving an informaldescription on an example here. The formal definitioncan be found in the full paper [5].The syntax of a grouping assignment statement isT - GROUP by .4 on G(R), with A and B attribute-set parameters. To see its effect, consider the groupingassignment statementSales _ GROUP by Region on sOld(%LleS) applied tothe table in Figure 4, top, the obvious counterpart inthe tabular model, of the table Sales in the relationaldatabase SalesInf ol of Figure 1. The resulting table,in Figure 4, bottom, is obtained as follows. (1)

    Its attribute row is obtained by first extracting fromthe attribute row of the original table the attributesdifferent from both Region and Sold (only Part in ourexample), and then adding this together with as manycopies of Sold as there are data rows in the originaltable. (2) Next, the column headed by Region is addedas the first data row of the new table. (3) Finally, thedata rows from top, after projecting out the Regionentries), are added to the table bottom, as follows,Consider row i in top. The Sold entry of this row isadded under the ith occurrence of the Sold column inbottom, on row i. The remaining entries of row i justadded to bottom are filled up with l.s.The resulting table can be seen as a very uneconomicalrepresentation of (the bold part of) the Sales tablein the tabular database Sales Inf 02 in Figure 1. It isactually this latter table we had in mind as the resultof grouping when we conceived this operation. To keepthe definitions simple, we eventually chose the former,and defined additional operations (see Section 3.4) toobtain the latter table.The syntax of a merging assignment statement is T _MERGIZ on ~ bY .A(R), with A and B attribute-set pa-rameters, Applying the merging assignment statementales - ERGE on Sold by Region(sales) n he a-ble in the tabular database SalesInfo2 in Figure 1yields the table in Figure 5, which can be seen as anuneconomical represent ation of the table in Figure 4,top. This is obtained by essentially reversing the stepsinvolved in computing the grouping operation. (The re-dundancy in the table of Figure 5 can be removed byselecting out the tuples with Sold entry J_, an oper-ation that can be simulated using projection, transpo-sition (Section 3.3), and difference. Applying the abovemerging assignment statement to the table in Figure 4,bottom yields a representation of the table top, butwhich is even more uneconomical. Finally it must beemphasized that merging is defined on all tables, notjust those that can be thought of as having resulted froma grouping. Also, any number of rows (columns) may benamed A (D) in the definition of MERGE.. @ by A(R).The syntax of a splitting assignment statement is T -SPLIT on &(R), with A an attribute-set parameter.

    97

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    6/11

    Sales Part Region Sold-1- nuts east 50

    -

    nuts west 60nuts south 40screws west 50screws north 60screws south 50bolts east 70bolts north 40

    Sales I Part Sold Sold Sold Sold Sold Sold Sold Sold

    t

    Figure4: Effect Of Sales +GROUP~Y Regl~non S~ld(Sales).

    To see its effect, consider the splitting assignmentstatement Sales t- SPLIT on ~e ion(sales) appliedto the table in Figure 4, top. f he operation resultsin a set of tables named Sales all of which have thesame attribute row, obtained by removing Region fromthe attribute row of the original table. In the set,there is one table for each Region entry in the originaltable. The first data row of each of these tables has(the literal constant) Region as the row name, and theRegion entry of the original table to which this tableis associated, in all other positions of this row. E.g, ,the table associated with east, has (Region, east, east)as the first data row, The remaining data rows areprojections of data rows of the original table, with amatching Region entry. E.g. , all rows in the originaltable with Region = east go into the table associatedwith east. The resulting set of tables is the bold partof the tabular database Sales Info4 in Figure 1.

    The syntax of a collapsing assignment statement isT - COLLAPSE by A (It), with .4 an attribute-set pa-rameter. Its effect is that all tables named R are firstmerged on A by all the attributes of their scheme. Then,their union is taken in the sense of Section 3.1. ApplyingSales +--- COLLAPSE by Re ion(Sales) to the tables

    Butlined in bold in the ta ular database Sales Info4in Figure 1 results in another (uneconomical) repre-sentation of the table in Figure 4, top, from which topcan be obtained by applying the redundancy removaloperations defined in Section 3.4.

    3.3 TranspositionThe tabular algebra contains two transposition opera-tors: transposing (in sensu stricto) and switching. Theeffect of T + TRANSPOSE(R) is that, for each table pnamed R, a new table is created by transposing p asa matrix and renaming the result T. Hence, columnattributes become row attributes, and vice-versa. Theeffect of T t- SWITCHV (R), with V an entry param-eter is that, for each table p named R, a new table iscreated. If V has a unique occurrence in p;, say ~~ = V,then the new table is obtained by swapping rows O andi, and columns O an j, and renaming the result as T;if V does not have a unique occurrence in p;, then thenew table is obtained by simply renaming p as T.For each of the operations defined in the tabular algebra,it is now possible to express in the tabular algebra adual operation obtained by interchanging the roles ofrows and columns. Using this technique and switching,it is possible to express a consiant selection T +UA=CV{(R) with A an attribute parameter and V anentry parameter.

    3.4 Removal of RedundancyThe tabular algebra also contains an operation forredundancy-removal: clean-up. The effect of T -CLEAN-UP by A O. B(R), with d and 1? attribute-setoperators, applied to a table p with name R, is thefollowing. For each A-subtuple of any tuple in p>,do the following. If all tuples pi, i > 0, with the

    98

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    7/11

    Sales Part Region Sold-L nuts east 50

    nutsnutsnutsscrewsscrewsscrewsscrewsboltsboltsbolts

    westnorthsoutheastwestnorthsoutheastwestnorth

    60-L40-L50605070L40

    bolts south -1-

    igure 5: ffect f ales - ERGEon Sold by Region(sales).

    same A-subtuple and with the column attribute p: inf?, are subsumed by a common tuple, then replace allthese tuples with the least such common tuple (whereleast refers to information content). Otherwise, retainthe original tuples, The result is named T. Thestatement Sales _ CLEAN-uP bY part. c,~ l(Sales)!applied to the table in Figure 4, bottom, results in atable in which the relevant information on nuts, screws,and bolts is grouped in one row for each.The dual operation of clean-up (in the sense of Sec-tion 3.4) is called purge. If the statement Sales eURGE on Sold by Region(sales) is applied to the ta-ble resulting from the above clean-up operation, then(the bold part of) the table Sales in SalesInfo2 is ob-tained. Thus, clean-up can be seen to be a generalizationof duplicate (row) elimination, while purge is its dual.Classical union of (the tables representing) two unioncompatible relations can be obtained by taking tabu-lar union, followed by applications of purge to eliminateredundant columns, and then clean-up to eliminate du-plicate rows.

    3.5 Tagging Operations and IterationIn order to achieve completeness, we introduce intothe tabular model the possibility to create new valuesas well as an iteration construct. Both features areinspired by their counterparts in the relational languageFO + new + while described in [3].The tuple tagging statement T t- TUPLENEWA(12),with A an attribute parameter, when applied to atable p with name R adds a new column to p withcolumn attribute name A containing a distinct newvalue (chosen non-deterministically from S) for eachtuple of p,. The result is named T. The set taggingstatement T t- SETNEWA (~), with A an attributeparameter, when applied to a table p with name R

    adds a new entry to the attribute row p. with nameA. The other rows of the new tables are obtained byconsecutively listing all non-empty subsets of p>. Eachsubset corresponds to a distinct new value in the newlyadded column named A. Finally, the resulting table iscalled T.Finally, a while program has the form while R # @doP, with P a tabular algebra program (see below) and Ra table name parameter. The semantics of such a whileprogram is that, to each combination of tables whosenames correspond to the table name parameters in thewhile program, the tabular program P is applied as longas the table corresponding to R contains a non-emptyset of data rows.

    3.6 Definition of Programs in Tabular AlgebraA tabular algebra program consists of a sequence oftabular algebra statements and while statements asdefined in the previous paragraphs,Recall that the assignment statements in TA are of theform T +-- (operation) (parameter list) (argtiment list).We now elaborate on syntax and semantics of the pa-rameters that may occur in the parameter list and theargument list in the right-hand side of an assignmentstatement. The general syntax of (parrunet er) is as fol-lows: 011-1*~1name){, (name)} I((paramet er), (paranret er))J- I (name) I(name) {, (name) [((paramet er), (parzunet er)).A parameter represents an entry or a set of entries, con-sisting of the interpretations of the items in the pos-itive list that are not interpretations of items in thenegative list. A star, possibly subscripted for distinc-tion, is a wild card. A pair of parameters defines entriesin the table under consideration by specifying attributeand column row entries.The parameter in the left-hand side of an assignmentstatement, (resp., in the condition of a while program)

    99

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    8/11

    may either be a name, or maybe a wild card occurring inthe right-hand side of the assignment statement (resp.,the body of the while program).The semantics of a tabular algebra program is nowas follows. All statements are executed consecutively.starting from an input database. This database isaugmented during the computation. Each statementis executed for all combinations of table names inthe interpretation of the corresponding parameters.If the statement is an assignment statement whoseright-hand side contains a wild card or if it is awhile program whose condition cent ains a wild card,then that wild card should be interpreted as thecorresponding name in the combination of table namesunder consideration. In each computation, a parameterrepresenting a single column attribute should have asingleton set as interpretation, otherwise the effect ofthe statement is undefined. A parameter representing aset of column attributes is interpreted in the obviousway. As is typical in such languages, programs intabular algebra may produce (scratch tables duringcomputation. As a result, the names of output tablesshould be specified as part of the program, whensimulating transformations, discussed below.4 Main ResultsIn this section, we present the main results of ourwork. First, we present the completeness results ofTA. Then we discuss the embedding of SchemaLog,a syntactic higher-order logical model, in the tabularmodel. Finally, we discuss the applicability of our modelfor OLAP based information systems.

    4.1 Completeness of TAFirst, we develop some formal notions for use in statingand explaining our results. If D is a tabular database,we call any finite set N ~ ~ containing all the names oftables in D a scheme for D. We denote by znst(N ) theset of all tabular databases for which N is a scheme.For a tabular database D, 1111 will denote the set ofsymbols occurring in D. Following Van den Busscheet al. [3], we define various morphisms on databases asfollows.Two tabular databases D and D are called isomorphicif there exists a bijection @ : 1111~ 1111 (called anisomorphism from D to 0) such that (i) ~ is the identityon the names in IDI; (ii) ~ is the identity on J_; and (iii)#(lI), where ~ is extended to tabular databases in anobvious way, and D, are identical up to permutationsof the non-attribute rows and permutations of the non-attribute columns of the tables under consideration. IfM ~ ID I, an isomorphism from D to D is called a M-Mesomorphismif its the identity on M. An automorphism

    of D is an isomorphism from D to itself. We denote byAut(D) the automorphism group of D.The following notion of transformation as a databasemapping expressing a restructuring operation or a queryis inspired by Chandra and Harel [6], Abiteboul andKanellakis [2], and Andries, Gyssens, Paredaensj Vanden Bussche, and Van Gucht [17, 3]:Let N G N. A transformation Q is a recursivelyenumerable relation Q ~ inst(N ) x inst(N ) such that(i) Q is invariant under every permutation of S thatis the identity on N U {1}; (ii) Q is invariant underpermutations of non-attribute rows and non-attributecolumns in tables; (iii) Q(D, D) implies IDI ~ IDI;(iv) Q(D, D~) and Q(D, D;) imply that D! and D; arelD1-isomorphic; and (v) Q(D, D) implies the existenceof a group homomorphism@ : Aut(D) --+ Aut(D ) suchthat, for every # in Aut(D), ~(~) is an extension of+.Condition (i) is known as genericity and formalizes theintuition that a transformation should not distinguishbetween non-null values that are not names; condi-tion (ii) says that the order of rows and columns ina table is irrelevant with respect to its meaning; condi-tion (iii) says that the set of database entries can onlygrow, even if entries no longer occur in a particular ta-ble; condition (iv) is known as determinacy and formal-izes that transformations are only non-deterministic inthe particular choices of new values (created by taggingoperations); and, finally, condition (v), known as con-structivity,formalizes that new values have to be relatedto the original values in a certain manner.The definition of transformation in the tabular modelis very close to the definition of transformation in therelational model for which the language FO+while+newwas shown to be complete [3]. To show completenessfor the tabular algebra for the notion of transformationgiven above, we use a reduction argument. Therefore,we first note the following:

    Theorem 4.1 The language FO + while+ new can bestmu!ated within the tabular algebra.

    The critical notion in the reduction argument mentionedabove is the notion of a canonical representation of atabular database. It also gives rise to a novel prooftechnique which is used to show that various relatedmodels and languages can be embedded within thetabular model and algebra.Let D be a tabular database over a scheme N . A canon-ical representation of D is a relational database R overthe relational database scheme Rep = {Data (Tbl, Row,Col, Val ) , Map ( Id, Entry)} with the functional de-pendencies Id ~ Entry, and Tbl, Row, Col -t Val, and

    100

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    9/11

    satisfying the following property: there exists a tablep in D with p:, p:, @o, and ~ on the indicated posi-tions if and only if there exists idl) id2, id3, and id4for which (idl, p~), (idz, p:), ids, #O), and (i%,~) arein Hap and (idl, id2, id3, i~) is in Data. Intuitively,a canonical representation of a database encodes everyoccurrence in the database as a unique id. The rela-tion Map associates to these ids the entries in the corre-sponding occurrences, and the relation Data associateseach occurrence to the occurrences of the correspondingrow attribute, column attribute and table name. Eventhough tables in tabular databases may have variablewidth, the canonical representation always encodes theinformation into a (relational) database with relationsof fixed width. Canonical representations are unique upto the particular choice of occurrence identifiers, justi-fying the phrase the canonical representation.We say that a program P computes a transformationQ if whenever Q(D, D), there exists D such that (i)(D, D) is an input-output pair of P and (ii) Q(D, D),Observe that, whenever a program P computes atransformation Q, we have that Q(D, D) for everyinput-output pair of P.

    Lemma 4.2 Let N be a set of names. There existsa program P P in the tabular algebra, only dependentupon N , such that for every tabular database D withscheme N , PRP(D) yields (the natural representationin the tabular model of) the canonical representation ofD.

    Lemma 4.3 Let N be a set of names. There exists aprogram PRP- in the tabular algebra, only dependentupon N , such that for (the natural representation inthe tabular model of) an instance R over the relationaldatabase scheme Itep in which all named entries belongto N , R = Rep(PRP- (R)).

    Although we cannot elaborate on the proof here, it isremarkable that the rather limited number of restruc-turing operations in the tabular algebrawhich weredeviced for restructuring purposes, not for complete-ness purposes are sufficient to establish the above com-pleteness results.We now state our main result:Theorem 4.4 The tabular algebra programs computeprecisely the transformations.

    The main idea behind the proof is as follows. Let Q bea transformation such that Q(D, D). Let Rep(D) andRep(D) be canonical representations. By the abovelemmas, we know we can compute D from (a tabular

    representation of) Rep(D) up to permutations of rowand columns, independent of the particular database D.Let Q ~ be the corresponding transformation. Similarly.Let Q 1 be the transformation computing (a tabularrepresent ation of) Rep(D) from D, independent ofthe particular database D. Now the composition Q jOQ o Q ~, considered as a relational database mapping,is constructive in the sense of [3] and can therefore beexpressed by a program P in the FO + while + new. LetP be the corresponding program in the tabular algebra.Then PReP o P o PRP- computes Q.For the complete details of the proof, we refer to [5].The proof sketch of our completeness results also yieldsa normal form for programs and transformations, bygoing via the canonical representations. It goes withoutsaying, however, that this is not the way to proceedin practice. The tabular algebra is sufficiently richto compute most of the transformation occurring inpractice in a much more direct way. By the sametoken, not all operations introduced here are necessaryto achieve completeness. The point, however, is thatwe would like tabular algebra to be rich so that usefultransformations can be expressed directly at a highlevel. We anticipate that the traditional, restructuring,and redundancy removal operations, and transpositionwould be sufficient for most useful transformationsarising in practice.

    4.2 Embedding SchemaLog into the TabularModel

    Lakshmanan et al. [11, 12] proposed a higher-orderlogic called Schema Log, and more recently an extensionto SQL called SchemaSQL [13], inspired by Schema-Log, for facilitating interoperability in a federation ofdatabases. The SchemaLog data model is essentiallythe relational model, with the following differences: (i)tuple ids and relation and attribute names are first-class citizens in the SchemaLog data model; and (ii)variable-width relations (e.g., see Figure 1) are possiblein the SchemaLog data model. Interestingly, it has beendemonstrated that the language of SchemaLog possesses(querying and) restructuring capabilities. These obser-vations suggest that the SchemaLog model has featuressimilar to the tabular model and that a comparison be-tween these models would be worthwhile.As noted above, SchemaLog was proposed in thecontext of multi-database interoperability. Our primaryconcern, however, is restructuring and querying ofindividual databases, To facilitate a comparison,we therefore consider a stripped-down version ofSchemaLog appropriate for interaction with individualdatabases. For convenience, we refer to this languageas SchemaLogd. Atomic formulas in SchemaLogd are

    101

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    10/11

    of the form Rel[Tid: Attr a Value], with Rel,Tid, Attr, Value constants or variables, in additionto atoms formed using the standard built-in predicatesand programming predicates (see [11]).Clearly, every SchemaLog relation can be readily repre-sented as a table in the tabular model. We were able toprove the following result:

    Theorem 4.5 For every program P in SchemaLogdthere is an equivalent program in the tabular algebra.

    Due to space restrictions, we refer to [5] for theproof. It may be argued that SchemaLogd programsessentially express transformations, so Theorem 4.5 isreally a corollary of Theorem 4.4. However, such anargument may shed little insight into the way suchtransformations can be simulated in tabular algebra, asthe resulting program would be too low level. Ourproof in [5] essentially gives a procedure for obtainingthe equivalent TA program at a high level.We note that it is a simple matter to extend thetabular model and algebra in a way that accounts fora federation of (tabular) databases. Such an extendedlanguage would trivially subsume SchemaLog (withoutfunction symbols). Notice that even though functionsymbols are not directly supported in the tabular model,nothing is lost in terms of the expressive power, becauseof the completeness result in the previous theorem.

    4.3 The Tabular Model as a FundamentalBasis for the OLAP Model

    The relational model, while supporting efficient datamanagement and robust on-line transaction processing(OLTP), provides little support for on-line analysis ofdata [7, 8], To overcome this deficiency, Codd hasrecently proposed [7, 14] a data model called OLAP(for on-line analytical processing), Some of the majorhighlights of the OLAP model are the following: (z)whereas the relational model organizes data along onedimension (i.e. ~ as a set of tuples) ~ the OLAP modelallows data to be stored in the form of (n-dimensional)matrices; and (ii) compared to the relational model,the OLAP model permits the efficient computation andstorage of summary information on the data. Bothfeatures (the first for n = 2) can also be found in thetabular model, and, in fact, are already illustrated bythe examples in Figure 1.Some of the drawbacks of the OLAP technology as itstands today are the following: (i) unlike relationalmodel, the OLAP model has no stable theoreticalfoundation and many concepts therein are used ratherloosely (e.g., see [7, 8, 14]); (ii) no languages comparable

    to relational algebra or calculus have been developed,and whatever operations are referred to in theliterature have no clear definition. Some loose proposalsfor SQL-like languages do exist, but are ad-hoc andhave no formal basis; and, finally, (iii) the integrationwith relational technology relies on ad-hoc proceduresfor converting between the two models rather than anyfundamental principles.The tabular model and language, studied for twodimensions in this paper, can be easily generalized ton dimensions. At a conceptual level, the tabular modelis more general than both relational and OLAP modelsand can serve as a common ground between them.At an implementation level, OLAP technology can beused as an efficient implementation of the physicalscheme associated with a tabular database. Becauseof the natural fit between (2- or n-dimensional) tablesand OLAP matrices, tabular algebra can be used asa fundamental querying and restructuring language forOLAP technology. Tabular algebra, being a completelanguage, provide a mechanism for restructuring OLAPmatrices in all meaningful ways, including relationalrepresent ations.5 Summary and Future ResearchWe proposed the tabular database model which issimple but powerful in allowing a very broad class ofnatural data represent ations. We developed tabularalgebra and showed that it is generic and completefor querying and restructuring of tables. Well-knowndata models such as GOOD and (a large fragmentof) SchemaLog can be embedded within the tabularmodel. Tabular model and algebra, while allowing a richexpressive power in enabling all transformations to beexpressed, also provide simple and intuitive operationsto express transformations in high level terms. Tabularalgebra (2- and n-dimensional versions) can serve asa complete querying and restructuring language forOLAP, a technology with great application potentialand of considerable interest to the database community,which has, until recently, lacked clear foundations.Tabular algebra is presently being implemented on topof Microsoft Access and Excel, providing a seamlessintegration of relations and spreadsheets. In ongoingwork, we are developing additional derived operationsin an effort to enhance the expressive power oftabular algebra in allowing high level expression oftransformations. Query (and program) optimization isan important issue. In the direction of OLAP, tabularalgebra covers only the aspect of restructuring. Weare presently working on operations corresponding toclassification and summarization, two other importantfunctionalities for OLAP. We also intend to extend ourimplementation above to cover all OLAP functionalities.

    102

  • 8/14/2019 Tables as a Paradigm for Querying and Restructuring

    11/11

    References[1]

    [2]

    [3]

    [4]

    [5]

    [6]

    [7]

    [8]

    [9]

    [10]

    [11]

    ACM. ACM Computing Surveys, volume 22, Sept1990. Special issue on HDBS.S. Abiteboul and P. Kanellakis, Object identity as aquery language primitive. In J. Clifford, B. Lindsay,and D. Maier, editors, Proceedings of the 1989ACM SIGMOD International Conference on theManagement of Data, volume 18:2 of SIGMODRecord, pages 159173. ACM Press, 1989. Fullversion to appear in Journal of the ACM.J. Van den Bussche, D. Van Gucht, M. An-dries, and Gyssens, M. On the completeness ofobject-creating database transformation languages,1994. manuscript. Preliminary version appeared inFOCS92 and PODS90.J. Van den Bussche, D. Van Gucht, M. Andries,and Gyssens, M. On the completeness of object-creating query languages. 33rd Symposium onFoundations of Computer Science, 1992.Gyssens, Marc, Lakshmanan, L. V, S,, and Subra-manian, I. N. Tables As a Paradigm for Query-ing and Restructuring Technical Report, Concor-dia University, Montreal, Nov 1995Chandra, Ashok K. and Harel, David. Computablequeries for relational data bases. Journal ofComputer and System Sciences, 21:156-178, 1980.Codd, E. F., Codd, S.B., and Salley C.T. Pro-viding olap (on-line analytical processing) to user-analysts: An IT mandate, 1995. White paper URL:http://www. arborsoft .com/papers/coddTOC .html.Finkelstein, Richard. Understanding the need foron-line analytical servers, 1995. White paper URL:http://www. arborsoft .com/papers/finkTOC .html.M, Gyssens, J. Paredaens, and D. Van Gucht. Agraph-oriented object database model. In ACMSymp. Principles of Database Systems, pages 417-424, 1990.Hurson, A. R., Bright, M. W., and Pakzad, S.Muhidatabase Systems : An Advanced SolutionFor Global Information Sharing. IEEE ComputerSociety, Los Alamitos, CA, 1994. Collection ofPapers.Lakshmanan, L. V. S., Sadri, F., and Subramanian,1, N. On the logical foundations of schema inte-gration and evolution in heterogeneous databasesystems. In Proc. 9rd International Conference onDeductive and Object-Oriented Databases (DOOD93). Springer-Verlag, LNCS-760, December 1993.

    [12]

    [13]

    [14]

    [15]

    [16][17]

    Lakshmanan, L. V. S., Sadri, F., and Subramanian,1. N. Logic and Algebraic Languages for Inter-operability in Multi-database Systems. Techni-cal report, Concordia University, Montreal, March1995. (Accepted to Journal of Logic Programming,February 1996).Lakshmanan, L. V. S., Sadri, F., and Subramanian,I. N. SchemaSQL - A Language for Querying andRestructuring Multidatabase Systems. Technicalreport, Concordia University, Montreal, February1996. (Submitted for publication.)Pilot Software. An introduction to olap, 1995.URL:http://www. pilotsw.com//pilotO 13 .htm.Serge Abiteboul, Richard Hull, and Victor Vianu.Foundations of Databases. Addison Wesley, Read-ing, MA, 1995.Han, Jiawei Personal Communication, Sept 1995M. Andries and J. Paredaens. On instance-completeness of database query languages involv-ing object creation. Journal of Computer and Sys-tem Sciences. To appear. See also A language forgeneric graph-transformations, Lecture Notes inComputer Science vol. 570, pp. 63-74.

    103