[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Making Designer Schemas with Colors

Post on 26-Feb-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Making Designer Schemas with Colors

    Nuwee Wiwatwattana H. V. Jagadish Laks V. S. Lakshmanan Divesh SrivastavaU of Michigan U of Michigan U of British Columbia AT&T LabsResearch

    nuwee@eecs.umich.edu jag@eecs.umich.edu laks@cs.ubc.ca divesh@research.att.com

    AbstractXML schema design has two opposing goals: elimina-

    tion of update anomalies requires that the schema be asnormalized as possible; yet higher query performance andsimpler query expression are often obtained through theuse of schemas that permit redundancy. In this paper, weshow that the recently proposed MCT data model, which ex-tends XML by adding colors, can be used to address this di-chotomy effectively. Specifically, we formalize the intuitionof anomaly avoidance in MCT using notions of node nor-mal and edge normal forms, and the goal of efficient queryprocessing using notions of association recoverability anddirect recoverability. We develop algorithms for transform-ing design specifications given as ER diagrams into MCTschemas that are in a node or edge normal form and satisfyassociation or direct recoverability. Experimental resultsusing a wide variety of ER diagrams validate the benefits ofour design methodology.

    1 MotivationAs XML has evolved from a document markup language

    to a widely-used format for exchange of structured andsemistructured data, managing large amounts of XML datahas become increasingly important. Since schema givesmeaning to data, and determines the validity of queries andupdates against the data, recently XML schema design is-sues have been investigated [2, 3, 15]. Fundamentally, XMLschema design has two opposing goals: elimination of up-date anomalies requires that the schema be as normalized aspossible; yet simpler query specification and higher queryperformance may both usually be obtained through the useof schemas that permit redundancy.1

    Consider, for example, the ER diagram of the TPC-W

    Supported in part by NSF under Grants No. IIS-0219513 and IIS-0438909.

    1Ideally, schema design should be completely divorced from physicalimplementation and hence should be separated from performance con-cerns. Realistically, physical implementation mimics the given schema,but seeks higher performance only through adding indices, materializedviews, and other such auxiliary structures. So the choice of schema has ahuge impact on performance.

    1

    1

    mm

    m

    author

    order_line

    item

    write

    1 1

    m

    m

    mm

    1

    m

    1 1

    1

    1

    m

    address

    occur_in

    has associateshippingbilling

    in

    make ordercustomer

    country

    credit_card_transaction

    Figure 1. ER diagram of TPC-W benchmark. Attributessuppressed for brevity.

    benchmark in Figure 1, where attributes are suppressed butcan be readily imagined. There is a straightforward wayof transforming such an ER diagram to an XML schema(see Figure 2), where entity types from the ER diagram(see Figure 1) are made children of the schema root, rela-tionship types are made children of one of their participat-ing entity types, and all remaining associations are capturedthrough id/idrefs attribute values (indicated in the figure us-ing directed edges). In such a normalized XML schema,update anomalies are avoided at the expense of poor queryperformance. Queries like Q1: list the orders placed bycustomers having addresses in Japan require an XQueryexpression involving multiple (id/idrefs) value-based joins,and can be expensive to evaluate. One might argue thatthis schema design is needlessly shallow, as it fails to lever-age the nested (tree) structure permitted by XML. Does thetradeoff between opposing design goals still exist when weexploit XMLs nesting?

    To appreciate this issue, consider, the XML schema ofFigure 3. In an XML database that conforms to this schema,an order element would be nested under the customerelement who made the order, which would be nested underthe customers address element, which would be nestedunder the corresponding country element. Given the :relationships inherent between relevant entity types in the

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • country_idref

    1

    1

    1

    1+

    111

    1transaction_idrefassociate

    ship_address_idref

    bill_address_idref

    customer idref

    shipping

    *

    ***

    root

    author

    item

    order_line

    order

    countryaddress

    customer*

    **

    occur_in make

    in

    has

    author_idrefaddress_idref

    write

    country_idref

    item_idrefcredit_card_transaction

    billing

    Figure 2. Shallow XML schema from TPC-W.

    *

    1

    1associate1

    occur_in

    *

    1

    1

    1

    1

    credit_card_transaction

    shipping

    billing

    *

    make

    has

    in

    country_idref

    bill_address_idref

    ship_address_idref

    item_idref

    1

    item

    write

    *

    order

    +

    root

    author*

    country

    address

    customer

    order_line

    *

    1

    Figure 3. Anomaly-free XML schema from TPC-W.

    ER diagram of Figure 1, no data would be duplicated inthis XML database, and hence update anomalies would beavoided. Further, our previous query Q1 has both a simpleexpression:2

    /country[@name = Japan]//order

    and an efficient evaluation strategy using structural joins,which have been shown to be much more efficient thanvalue-based joins [1, 7].

    The single-tree structure of XML, however, imposesmany limitations. For example, in the XML schema ofFigure 3, the billing association between order andaddress is encoded as attribute values (indicated using adirected edge in the figure) in corresponding elements. Thismakes many other queries more cumbersome to express andexpensive to evaluate. Consider Q2: list the orders placedwith billing addresses in Japan:

    for $o in //order,$a in /country[@name=Japan]//address

    where $o/billing/@bill address idref=$a/@idreturn $o

    A different XML schema design to the one in Fig-ure 3, also without update anomalies, could have nested anorder under the address based on the billing address

    2We use XPath and XQuery for stating XML queries.

    associate

    1

    1

    1has

    in credit_card_transactionwrite

    billing

    1

    1

    1

    1shipping

    occur_in

    1 1

    make

    +

    1

    1

    1

    1country

    *

    1

    1

    root

    customer

    order

    order_line

    *

    address

    author

    item1

    Figure 4. Deep XML schema from TPC-W with data re-dundancy. The XML tree will be obtained by traversing thisgraph from the root, and permitting multiple occurrences ofelements.

    association. This would have simplified the expression ofQ2, but at the expense of Q1.

    It is, of course, feasible to design XML schemas that al-low easy expression, using XPath and XQuery, of a largevariety of queries, as well as their efficient evaluation, butthis is at the expense of extreme data redundancy. Figure 4is an example of such a schema: many queries are cap-tured using XPath, but there can be a great deal of redun-dancy in the representation of various types of address,country, item, and author elements. These exam-ples suggest the desirable goal of a schema that is beneficialfor both updates and queries might be elusive for the XMLmodel. It is clear that the single-tree structure of XML doesnot suffice for meeting this goal. But what about XML-likemodels, such as the recently proposed MCT (multi-coloredtrees) logical model for XML databases [13], which extendsthe XMLmodel by supportingmultiple tree structures (eachin a different color) overlaid on the same set of XML dataelements? In this paper, we investigate the following foun-dational question for XML/MCT databases:

    Given an ER diagram, is it possible to designMCT schemas where (i) update anomalies canbe avoided, and (ii) all associations explicitly en-coded in the ER diagram can be expressed usingstructural (XPath-like) expressions?

    We show, surprisingly, that this is indeed possible, andwe propose a novel schema design methodology for achiev-ing the twin goals of update anomaly avoidance and easeof query expression plus evaluation efficiency, making useof the MCT data model. Our starting point is the designspecification, expressed in the form of an ER diagram. Inparticular, for the TPC-W ER diagram of Figure 1, Figure 5shows an MCT schema that achieves this goal. We makethe following contributions in this paper:We formally define the types of associations between

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • order

    1has

    in*

    *

    billing

    1

    1+

    1

    *

    country

    *

    *

    country

    address

    customer

    order

    1

    1

    11

    1make

    has

    1

    1

    *1

    1

    1

    1

    *

    *

    *

    1

    1

    1

    +

    * *

    1

    1

    write

    1

    1

    *

    address1

    1

    *

    *+

    author

    shipping

    orde

    r_line

    associate

    credit_card_transaction

    credit_card_transactionbilling

    order_line

    order

    associate

    occur_in

    occur_in

    ordermake

    customer

    makeorder_line

    credit_card_transaction

    1credit_card_transaction

    (GREEN)

    shipping

    billing

    (BLUE) (RED) (PURPLE) (ORANGE)

    associate

    in1

    order_line

    occur in

    associate billing

    order_line

    item

    *

    *

    *

    + 1

    shipping

Recommended

View more >