[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Making Designer Schemas with Colors

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Making Designer Schemas with Colors

Post on 26-Feb-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Making Designer Schemas with Colors

    Nuwee Wiwatwattana H. V. Jagadish Laks V. S. Lakshmanan Divesh SrivastavaU of Michigan U of Michigan U of British Columbia AT&T LabsResearch

    nuwee@eecs.umich.edu jag@eecs.umich.edu laks@cs.ubc.ca divesh@research.att.com

    AbstractXML schema design has two opposing goals: elimina-

    tion of update anomalies requires that the schema be asnormalized as possible; yet higher query performance andsimpler query expression are often obtained through theuse of schemas that permit redundancy. In this paper, weshow that the recently proposed MCT data model, which ex-tends XML by adding colors, can be used to address this di-chotomy effectively. Specifically, we formalize the intuitionof anomaly avoidance in MCT using notions of node nor-mal and edge normal forms, and the goal of efficient queryprocessing using notions of association recoverability anddirect recoverability. We develop algorithms for transform-ing design specifications given as ER diagrams into MCTschemas that are in a node or edge normal form and satisfyassociation or direct recoverability. Experimental resultsusing a wide variety of ER diagrams validate the benefits ofour design methodology.

    1 MotivationAs XML has evolved from a document markup language

    to a widely-used format for exchange of structured andsemistructured data, managing large amounts of XML datahas become increasingly important. Since schema givesmeaning to data, and determines the validity of queries andupdates against the data, recently XML schema design is-sues have been investigated [2, 3, 15]. Fundamentally, XMLschema design has two opposing goals: elimination of up-date anomalies requires that the schema be as normalized aspossible; yet simpler query specification and higher queryperformance may both usually be obtained through the useof schemas that permit redundancy.1

    Consider, for example, the ER diagram of the TPC-W

    Supported in part by NSF under Grants No. IIS-0219513 and IIS-0438909.

    1Ideally, schema design should be completely divorced from physicalimplementation and hence should be separated from performance con-cerns. Realistically, physical implementation mimics the given schema,but seeks higher performance only through adding indices, materializedviews, and other such auxiliary structures. So the choice of schema has ahuge impact on performance.

    1

    1

    mm

    m

    author

    order_line

    item

    write

    1 1

    m

    m

    mm

    1

    m

    1 1

    1

    1

    m

    address

    occur_in

    has associateshippingbilling

    in

    make ordercustomer

    country

    credit_card_transaction

    Figure 1. ER diagram of TPC-W benchmark. Attributessuppressed for brevity.

    benchmark in Figure 1, where attributes are suppressed butcan be readily imagined. There is a straightforward wayof transforming such an ER diagram to an XML schema(see Figure 2), where entity types from the ER diagram(see Figure 1) are made children of the schema root, rela-tionship types are made children of one of their participat-ing entity types, and all remaining associations are capturedthrough id/idrefs attribute values (indicated in the figure us-ing directed edges). In such a normalized XML schema,update anomalies are avoided at the expense of poor queryperformance. Queries like Q1: list the orders placed bycustomers having addresses in Japan require an XQueryexpression involving multiple (id/idrefs) value-based joins,and can be expensive to evaluate. One might argue thatthis schema design is needlessly shallow, as it fails to lever-age the nested (tree) structure permitted by XML. Does thetradeoff between opposing design goals still exist when weexploit XMLs nesting?

    To appreciate this issue, consider, the XML schema ofFigure 3. In an XML database that conforms to this schema,an order element would be nested under the customerelement who made the order, which would be nested underthe customers address element, which would be nestedunder the corresponding country element. Given the :relationships inherent between relevant entity types in the

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • country_idref

    1

    1

    1

    1+

    111

    1transaction_idrefassociate

    ship_address_idref

    bill_address_idref

    customer idref

    shipping

    *

    ***

    root

    author

    item

    order_line

    order

    countryaddress

    customer*

    **

    occur_in make

    in

    has

    author_idrefaddress_idref

    write

    country_idref

    item_idrefcredit_card_transaction

    billing

    Figure 2. Shallow XML schema from TPC-W.

    *

    1

    1associate1

    occur_in

    *

    1

    1

    1

    1

    credit_card_transaction

    shipping

    billing

    *

    make

    has

    in

    country_idref

    bill_address_idref

    ship_address_idref

    item_idref

    1

    item

    write

    *

    order

    +

    root

    author*

    country

    address

    customer

    order_line

    *

    1

    Figure 3. Anomaly-free XML schema from TPC-W.

    ER diagram of Figure 1, no data would be duplicated inthis XML database, and hence update anomalies would beavoided. Further, our previous query Q1 has both a simpleexpression:2

    /country[@name = Japan]//order

    and an efficient evaluation strategy using structural joins,which have been shown to be much more efficient thanvalue-based joins [1, 7].

    The single-tree structure of XML, however, imposesmany limitations. For example, in the XML schema ofFigure 3, the billing association between order andaddress is encoded as attribute values (indicated using adirected edge in the figure) in corresponding elements. Thismakes many other queries more cumbersome to express andexpensive to evaluate. Consider Q2: list the orders placedwith billing addresses in Japan:

    for $o in //order,$a in /country[@name=Japan]//address

    where $o/billing/@bill address idref=$a/@idreturn $o

    A different XML schema design to the one in Fig-ure 3, also without update anomalies, could have nested anorder under the address based on the billing address

    2We use XPath and XQuery for stating XML queries.

    associate

    1

    1

    1has

    in credit_card_transactionwrite

    billing

    1

    1

    1

    1shipping

    occur_in

    1 1

    make

    +

    1

    1

    1

    1country

    *

    1

    1

    root

    customer

    order

    order_line

    *

    address

    author

    item1

    Figure 4. Deep XML schema from TPC-W with data re-dundancy. The XML tree will be obtained by traversing thisgraph from the root, and permitting multiple occurrences ofelements.

    association. This would have simplified the expression ofQ2, but at the expense of Q1.

    It is, of course, feasible to design XML schemas that al-low easy expression, using XPath and XQuery, of a largevariety of queries, as well as their efficient evaluation, butthis is at the expense of extreme data redundancy. Figure 4is an example of such a schema: many queries are cap-tured using XPath, but there can be a great deal of redun-dancy in the representation of various types of address,country, item, and author elements. These exam-ples suggest the desirable goal of a schema that is beneficialfor both updates and queries might be elusive for the XMLmodel. It is clear that the single-tree structure of XML doesnot suffice for meeting this goal. But what about XML-likemodels, such as the recently proposed MCT (multi-coloredtrees) logical model for XML databases [13], which extendsthe XMLmodel by supportingmultiple tree structures (eachin a different color) overlaid on the same set of XML dataelements? In this paper, we investigate the following foun-dational question for XML/MCT databases:

    Given an ER diagram, is it possible to designMCT schemas where (i) update anomalies canbe avoided, and (ii) all associations explicitly en-coded in the ER diagram can be expressed usingstructural (XPath-like) expressions?

    We show, surprisingly, that this is indeed possible, andwe propose a novel schema design methodology for achiev-ing the twin goals of update anomaly avoidance and easeof query expression plus evaluation efficiency, making useof the MCT data model. Our starting point is the designspecification, expressed in the form of an ER diagram. Inparticular, for the TPC-W ER diagram of Figure 1, Figure 5shows an MCT schema that achieves this goal. We makethe following contributions in this paper:We formally define the types of associations between

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • order

    1has

    in*

    *

    billing

    1

    1+

    1

    *

    country

    *

    *

    country

    address

    customer

    order

    1

    1

    11

    1make

    has

    1

    1

    *1

    1

    1

    1

    *

    *

    *

    1

    1

    1

    +

    * *

    1

    1

    write

    1

    1

    *

    address1

    1

    *

    *+

    author

    shipping

    orde

    r_line

    associate

    credit_card_transaction

    credit_card_transactionbilling

    order_line

    order

    associate

    occur_in

    occur_in

    ordermake

    customer

    makeorder_line

    credit_card_transaction

    1credit_card_transaction

    (GREEN)

    shipping

    billing

    (BLUE) (RED) (PURPLE) (ORANGE)

    associate

    in1

    order_line

    occur in

    associate billing

    order_line

    item

    *

    *

    *

    + 1

    shipping

    customer

    has

    customer

    make

    shipping

    occur_in

    has

    country

    in

    address address

    in

    country

    1

    1 1

    Figure 5. AnMCT Schema for the ER Graph of the TPC-W schema in Figure 1. Nodes/edges shown repeated per color for clarity,but are represented/stored once.

    data items encoded in an ER diagram that need to be ex-pressed using structural (XPath-like) expressions.

    We define a number of desirable properties for the targetlogical MCT schema to satisfy: association recoverability: no explicit value-basedcomparisons should be needed to recover any associa-tions between data items encoded in the ER diagram,and direct recoverability: an aggressive version ofassociation recoverability, which says that certain bi-nary associations need to be captured by a single XPathaxis step; these formalize the goal of expression easeand efficient evaluation of queries.

    node normalization: no entity or relationship shouldbe present more than once in any colored tree in anMCT/XML database, and edge normalization: no as-sociation should be present in more than one color;these formalize the goal of update anomaly-avoidance.

    instance independence: the number of colors shouldbe independent of the database instance, and color fru-gality: this reduces the cost/overhead that comes withmaintaining multiple colors.

    We develop algorithms for translating an ER diagram intoan MCT schema satisfying the various desirable proper-ties. Not all combinations of properties are achievableat once. We formally show which combinations are fea-sible, and show that our algorithms achieve those goals.

    Experimental results using a wide variety of ER diagramsvalidate the benefits of our design methodology.

    2 Preliminaries2.1 ER Diagram and ER Graph

    The entity-relationship (ER) model, first introduced byChen [9], is widely used for conceptual modeling in

    database design. Over time, many different flavors of ERhave emerged. For concreteness, the version we referencein this paper is from Elmasri and Navathe [10].

    Specifically, we consider simplified ER diagrams, thatcontain only entity types, binary relationship types betweendistinct entity or relationship types, and atomic attributes.Arbitrary ER diagrams can be translated into such simpli-fied ER diagrams by applying simple transformations [20].

    For our translation purposes, we will find it convenientto regard a simplified ER Diagram as an undirected graph,called the ER graph, with one node corresponding to eachentity and each relationship type, and an edge between apair of nodes whenever they are adjacent in the ER dia-gram. Node labels and edge labels in the ER graph carrythe desired semantic information from the ER diagram.

    2.2 MCT: Data Model and Query LanguageThe multi-colored trees (MCT) logical data model [13]

    is an evolutionary extension of the XML data model of [11],and enhances it in two significant ways:Each data node has an additional property, referred to asa color, and nodes can have one or more colors from afinite set of colors.

    An MCT database consists of one or more colored trees, where is the number of colors.

    Essentially, a single colored tree is just like an XML tree.Allowing for multiple colored trees permits richer seman-tic structure to be added over the individual nodes in thedatabase. In an MCT database, a node belongs to exactlyone rooted tree for each of its colors.

    MCT databases can be naturally queried using exten-sions of XPath [5] and XQuery [6]. In particular, eachaxis step in a path expression needs to be augmented witha color, identifying the colored tree in which the navigationis desired. We refer to this as the multi-colored version ofXPath. This suffices for the purpose of this paper. For moredetails, see [13].

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • = has.address.in

    = make.order.orderline

    country

    itemcustomer

    Figure 6. An Example Association Graph.

    2.3 MCT: Schema

    Informally, a multi-colored XML schema is a set ofXML schemas, one for each color, along with possibleinter-color integrity constraints (ICICs). Formally, anMCTschema is a tuple , where:

    is a set of labeled nodes as in an XML schema,

    is the number of colors,

    each is a set of edges that define an ordered labeledgraph on , and

    is a set of ICICs, where each ICIC is a tuple of edgesof the form , where (the numberof colors), , whenever , where each edgeis between the same pair of nodes, say , but indistinct colors.Intuitively, an ICIC says that in any valid

    database instance either the edge between the nodes andmust be present in all colors , or it must be absentin all colors. For example, in Figure 5, there is an edge be-tween nodes order and shipping in each of the colorsblue, red, purple, and green. An ICIC on the correspond-ing tuple of edges signifies that the association between anorder element and a corresponding shipping elementshould either be reflected in all four colors or in none atall. Otherwise, there would be an inconsistency betweenthe information encoded in these colored trees. In contrast,there is just one edge between the nodes make and order,which is present in blue. Accordingly, for this edge there isno associated inter-color integrity constraint.

    3 MCT: Desirable Properties, Problem

    3.1 Associations, Association Recoverability

    A key goal in good schema design is ease of query ex-pression and efficiency of evaluation. But this raises thequestion: which queries should be expressible easily andevaluated efficiently? In this section, we make this precise,based on the notions of associations, association recover-ability, and direct recoverability.

    Since we are interested in good MCT database schemadesign, we need to start with an initial design specifica-tion, for which we use the time-tested ER model. In theER model, a pair of entity or relationship types is said to beassociated, if there is a path between them in the ER graph.More generally, an association is any connected sub-graph

    of the transitive closure of the ER graph. Intuitively, an as-sociation graph defines a semantically meaningful tuple ofentity/relationship types, with association graph edge labelscapturing the path traversed in the ER graph. For example,Figure 6 shows an association graph from (the ER graph as-sociated with) Figure 1. The edge islabeled has.address.in, identifying the ER graph pathcorresponding to this edge....

Recommended

View more >