integrating xml and relational database systems

42

Click here to load reader

Upload: gerti-kappel

Post on 06-Aug-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating XML and Relational Database Systems

World Wide Web: Internet and Web Information Systems, 7, 343–384, 2004 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Integrating XML and Relational Database Systems

GERTI KAPPEL [email protected] of Software Technology and Interactive Systems, Business Informatics Group (BIG), Vienna Universityof Technology, Favoritenstraße 9-11/188, A-1040 Wien, Austria

ELISABETH KAPSAMMER and WERNER RETSCHITZEGGER {ek,wr}@ifs.uni-linz.ac.atInstitute of Applied Computer Science, Department of Information Systems (IFS), University of Linz,Altenbergerstraße 69, A-4040 Linz, Austria

Abstract

Relational databases get more and more employed in order to store the content of a web site. At the sametime, XML is fast emerging as the dominant standard at the hypertext level of web site management describingpages and links between them. Thus, the integration of XML with relational database systems to enable thestorage, retrieval, and update of XML documents is of major importance. Data model heterogeneity and schemaheterogeneity, however, make this a challenging task. In this respect, the contribution of this paper is threefold.First, a comparison of concepts available in XML schema specification languages and relational database systemsis provided. Second, basic kinds of mappings between XML concepts and relational concepts are presented andreasonable mappings in terms of mapping patterns are determined. Third, design alternatives for integrating XMLand relational database systems are examined and X-Ray, a generic approach for integrating XML with relationaldatabase systems is proposed. Finally, an in-depth evaluation of related approaches illustrates the current state ofthe art with respect to the design goals of X-Ray.

Keywords: relational database system, XML, heterogeneity, meta schema, generic integration

1. Introduction

Web-based information systems no longer aim at purely providing read-only access to theircontent, which is simply represented in terms of web pages stored in the web server’s di-rectory. Nowadays, not least due to new requirements emerging from several applicationareas such as electronic commerce, the employment of databases (DB) to store the contentof a web site turns out to be worthwhile [27,50]. This allows to easily handle both retrievaland update of large amounts of data in a consistent way on a large distributed scale [23].Besides using databases at the content level, the Extensible Markup Language (XML) [68]is fast emerging as the dominant standard for representing the hypertext level of a web site,i.e., the logical composition of web pages and the navigation among them [1,15,58,66].Since there is no blending with layout aspects, multidelivery is supported, meaning thatone and the same hypertext can be easily rendered according to, e.g., different devices [43].Furthermore, XML has become the first choice for data exchange between different orga-nizations. Because of the increasing importance of XML and database systems (DBS), theintegration of them with respect to storage, retrieval, and update is a major need [16,66].

Page 2: Integrating XML and Relational Database Systems

344 G. KAPPEL ET AL.

Regarding the kind of data model used by the DBS as basis for integration, one candistinguish three alternatives [10,28]. First, special-purpose or native DBS are particu-larly tailored to store, retrieve, and update XML documents by using XML itself as un-derlying data model. Examples thereof are research prototypes such as Rufus [62], Lore[30], Strudel [27], and Natix [37] as well as commercial systems such as Tamino [57] andInfonyte [35]. Second, because of their rich object-oriented data modeling capabilities,object-oriented DBS such as Poet [49] as well as object-relational DBS such as DB2 [6],Oracle [32], and SQLServer [54] are well-suited for storing XML documents. Third, con-cerning the latter systems, there is also the possibility to use their relational data model forintegration purposes. This alternative is especially motivated by the fact that currently, asignificant amount of data is stored in pre-existing relational databases1 and will continueto be used by existing applications in the future [29]. There is an increasing demand topublish (parts of) these existing relational data as XML documents according to existing(standardized) XML schemata in terms of a document type definition (DTD) [68] or byusing the more powerful XML Schema language [69]2 or, vice versa, for storing XMLdocuments into existing databases.

Concerning the integration with RDBS, there exist three basic alternatives. The moststraightforward approach would be to store XML documents as a whole within a singledatabase attribute, using a simple CLOB attribute (cf., e.g., [22]) or a dedicated XML datatype (cf., e.g., DB2 [6], Oracle [32], or the forthcoming SQL/XML-standard [24]). An-other possibility would be to decompose XML documents in some way (“shredding”), e.g.,into a graph structure and store them into appropriate database tables (cf., e.g., [28]). Inboth cases, the XML schema specification in form of tags and attributes is stored togetherwith the content of the XML document as database values. The advantage is that the DBschema is independent of the structure of the XML documents, thus allowing to store XMLdocuments having arbitrary structures. One disadvantage occurring with the first alterna-tive is that, concerning transaction management, it is no more possible to lock only partsof this document for update purposes. A drawback of the second alternative is that a query,for example, has to reconstruct the schema (possibly involving several joins) before beingable to access the “real” data needed, thus making query formulation cumbersome and de-creasing performance. Finally, both approaches do not allow the integration with alreadyexisting relational schemata. To prevent these deficiencies, the third integration alternativeis that the structure of XML documents is mapped to a corresponding relational schemawherein XML documents are stored according to the mapping (cf., e.g., [11,22,59]). Onlythis approach allows to reuse existing relational schemata and thus is further investigatedin the paper3.

One major challenge of this schema-to-schema mapping approach is the existence ofdata model heterogeneity and schema heterogeneity4. Data model heterogeneity refers tothe fact that there are fundamental differences between concepts provided by XML andthose provided by RDBS, which have to be considered when defining a certain mapping.These differences concern, e.g., structuring, typing, and identification issues, relationships,default declarations, and the order of stored instances (cf. Section 2). Schema heterogene-ity means that, even if the XML schema specification and the relational schema representthe same part of the universe of discourse, the design of both may be different, due to

Page 3: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 345

different design goals or simply since they have been developed independently from eachother without having integration in mind. Consider for example business to business elec-tronic commerce, where a supplier wants to store the product catalogue of another supplierrepresented in XML within an already existing relational database. In this scenario, theautonomy of both the XML schema specification and the relational schema should be pre-served in that neither of them has to be changed.

This paper deals with these different forms of heterogeneity, fitting together previouswork done in this area [39–41]. First, an in-depth analysis of data model heterogeneity isprovided by comparing concepts available in RDBS and XML schema specification lan-guages, comprising XML DTD and XML Schema (cf. Section 2). Second, basic kindsof mappings between XML concepts and relational concepts are presented and reasonablemappings in terms of mapping patterns are determined to mediate between the differentstructuring mechanisms supported by XML and RDBS (cf. Section 3). Third, design alter-natives for integrating XML and relational database systems are examined (cf. Section 4)and X-Ray, a generic approach for integrating XML with relational database systems isproposed (cf. Section 5). An extensive evaluation of existing approaches closely related toX-Ray is presented in Section 6. Finally, Section 7 concludes the paper with a summaryand gives an outlook to future work.

2. Comparison of concepts – XML versus RDBS

This section is dedicated to an in-depth investigation of the similarities and differencesbetween XML concepts and RDBS concepts at different levels of abstraction. It focuseson six different aspects of the data models, comprising structuring and typing mechanisms,uniqueness of names, null values and default values, identification, relationships, and order.Concerning XML, DTD concepts are used as a starting point, XML Schema concepts areconsidered as far as they are different or go beyond those of DTDs. It has to be notedthat we do not consider every XML Schema concept in full detail but rather try to give anoverview of those concepts closely related to DTDs and RDBS. Concerning RDBS, it isdistinguished, as far as necessary, between concepts as defined by the original relationalmodel and their realization by the SQL standard [5]. This section provides the basis foranalyzing different mapping possibilities as done in Section 3. Further, it represents theprerequisite for discussing design goals and design decisions as done in Section 4 and fordeveloping a meta schema in order to bridge the heterogeneous concepts as described inSection 5.

2.1. Levels of abstraction

For discussing and comparing the basic concepts of XML from a database point of view, itis important to keep in mind that they belong to different levels of abstraction as illustratedin Figure 1. These levels comprise the data model level (i.e., the concepts provided fordefining the structure of data), the schema level (i.e., the utilization of data model concepts

Page 4: Integrating XML and Relational Database Systems

346 G. KAPPEL ET AL.

Figure 1. Concepts at different levels of abstraction.

for structuring certain domain data), and the instance level (i.e., a concrete XML documentor relational database being the instance of a certain schema).

Regarding the data model concepts provided by RDBS and XML, there are fundamen-tal differences leading to data model heterogeneity, which aggravate the integration ofboth paradigms. Heterogeneity with respect to data models is mainly due to the differ-ent purposes RDBS and XML have been developed for. The aim of RDBS is to storelarge amounts of data enabling efficient access and ensuring their consistency [5]. In con-trast, XML is intended to serve as a format for structuring and exchanging hypertext docu-ments [1,66]. Data model heterogeneity is discussed in the following subsections focusingon typing mechanisms, null values and default values, identification, relationships, and or-der of instances. Since from a database perspective, many concepts supported by DTDsare insufficient for schema definition, e.g., the typing mechanisms, XML is often referredto as a data format, only, having no appropriate data model. Consequently, there havebeen strong efforts to supplement DTDs by means of the richer XML schema specifi-cation language XML Schema [69]. In contrast to DTDs XML Schema is expressed bymeans of XML itself. Extensions include a richer set of primitive data types as well as amechanism to enable inheritance. Although the use of XML itself as language to specifyvarious schemata allows to reuse existing XML tools for schema validation, the use ofXML Schema is by far more complex than simply using DTD syntax.

Analogous to heterogeneity at the data model level, there may also be heterogeneityat the schema level. This is very likely since the schema of an XML document is al-lowed to be irregular, implicit, partial, incomplete, not always known ahead of time, andmay change frequently and without notice, which demonstrates the close resemblance tosemi-structured data [68]. In case that XML documents are based on an explicit schemaspecification, applications are able to validate the documents’ structure with respect to thisspecification by means of an XML parser [19]. A DTD as well as an XML Schema speci-fication can be stored within a separate file referenced from within the XML documents bymeans of an URI (Unified Resource Identifier) [7,68]. In addition, a DTD can be includeddirectly within an XML document. Since explicit schema specifications are optional, it is

Page 5: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 347

not clear at this time if more XML documents will be governed by such schemata or ifmore documents will exist without them [66]. These are fundamental differences to RDBSwhere the existence of an a priori schema, which is stored directly within the database, ismandatory, and the validity of tuples with respect to this schema is checked by the systembefore inserting them into the database.

Finally, at the instance level, we consider a certain XML document and a certain rela-tional database, respectively. XML documents are self-describing, meaning that parts ofthe schema definition in terms of tags are replicated within each XML document, no matterif the schema is defined explicitly or not. This is in contrast to RDBS where the schemaexists only once for the whole database. Storing the schema along with the data, as XMLis doing, provides flexibility with respect to both integrating heterogeneous sources andchanges to the structure. However, the replicated schema information implies space costfor storage, time cost for retrieval, and the danger of inconsistencies in case of schemaupdates [19].

2.2. Structuring and typing mechanisms

The basic mechanisms used to specify the structure of XML documents and relationalschemata are element types and attributes for XML as well as relations and attributesfor RDBS, respectively (cf. Figure 1). Concerning element types, it is useful for furtherdiscussions to categorize them along two dimensions (cf. Table 1). The first dimensiondepicts whether the element type contains an atomic domain or not whereas the seconddimension denotes whether the element type contains a composite domain or not. Thisdistinction results in four different kinds of element types5. It has to be emphasized thatthis classification is applicable to both DTDs and XML Schema. RDBS, however, do notallow to specify domains for relations, but for attributes, only. Let’s consider each of thesekinds of element types as well as XML and database attributes in more detail.

Element types that contain an atomic domain, only, are called atomic element types.Concerning DTD element types, there is only one possible predefined atomic domain,namely #PCDATA. The predefined atomic domains for DTD attributes comprise a stringtype called CDATA (which is different from #PCDATA in that if #PCDATA-values containtags, these tags are interpreted by an XML parser), an enumeration type, and some specialtypes including, e.g., ID and IDREF(S) (cf. Sections 2.5 and 2.6).

In contrast to DTDs, XML Schema provides a large range of predefined atomic domainsfor both, element types and attributes. These predefined atomic domains are comparable to

Table 1. Kinds of element types.

Kind of Element Type (ET) Atomic domain Composite domain

Atomic ET � ×Composite ET with element content × �Composite ET with mixed content � �Empty ET × ×Legend: � – contains, × – does not contain.

Page 6: Integrating XML and Relational Database Systems

348 G. KAPPEL ET AL.

those present in RDBS and include some special ones like anyURI to represent URIs [7]and QName (qualified name) to specify a name that may have a namespace prefix. XMLSchema allows atomic domains to be used as a basis to derive user-defined atomic domains,which is similar to the object-oriented concept of sub-classing by specifying appropriateextensions or constraints, e.g., length restrictions or enumeration restrictions.

Besides atomic domains, element types are allowed to be associated with a compositedomain, furtheron called composite element types. Such an element type contains otherelement types called component element types used to build arbitrarily deep part-of hier-archies by means of nesting. For each XML document, it is required that all componentelement types are rooted in a single element type. This is in contrast to RDBS, where part-of hierarchies cannot be realized by means of nesting since relations consist of atomic-valued attributes, only. However, part-of hierarchies can be expressed in RDBS by meansof foreign key constraints (cf. Section 2.6). Since composite element types may have anatomic domain in addition to a composite domain, they are further distinguished into com-posite element types with mixed content and composite element types with element content(cf. Table 1). Concerning the latter, it has to be specified whether component element typesoccur in a sequence, or as choice meaning that they are mutual exclusive.

Considering the definition of composite element types, there is a significant differencebetween DTDs and XML Schema. In contrast to DTDs, XML Schema separates the defin-ition of a composite element type from the declaration of its composite domain specifyingthe component element types. This separation allows to reuse a domain for different com-posite element types, i.e., they can share the same domain. Figure 2 illustrates the defini-tion of the composite element type village and the specification of the composite domainvillageInfo. The keyword complexType denotes the declaration of this composite do-main. Note, that defining a schema specification using XML Schema namespaces have tobe used to distinguish between elements and data types provided by XML Schema and el-ements and data types defined by the particular schema specification. In this example thisis done by applying the prefix “acc” when utilizing the complex domain villageInfo

as type for the element village.Similar to atomic domains, composite domains can be derived from each other, which is

not supported by DTDs and RDBS. Element types that neither have an atomic domain nora composite domain are called empty element types. An element type can also be declaredto have ANY content, meaning that there is no restriction concerning the kind of elementtype. In XML Schema, any is more powerful in that it can be restricted to element typesof a certain namespace and can also be used for attributes (anyAttribute). Finally, eachelement type no matter if it is an atomic, composite, or empty element type may containXML attributes. Table 1 summarizes the different kinds of element types by denoting theircharacteristics (concerning examples it is referred to Sections 3.1 and 3.2).

Concerning the instance level, XML documents contain elements each of them markedby a start tag and an end tag in terms of the name of a specific element type. The ele-ment may contain component elements expressed by nested tags as well as attributes. Bothelements and attributes are allowed to contain values, therefore we distinguish betweenelement values and attribute values. Attribute names and their values are placed withinthe start tag, whereas element values occur between start tag and end tag. Consequently,

Page 7: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 349

DTD XML Schema<!ELEMENT village <complexType name=”villageInfo”>

(name, country, accommodation*)> <sequence>

<!ELEMENT name (#PCDATA)> <element name=”name” type=”string”

<!ELEMENT country (#PCDATA)> minOccurs=”1” maxOccurs=”1”/>

<!ELEMENT accommodation (#PCDATA)> <element name=”country” type=”string”

minOccurs=”1” maxOccurs=”1”/>

<element name=”accommodation” type=”string”

minOccurs=”0” maxOccurs=”unbounded”/>

</sequence>

</complexType>

<element name=”village” type=”acc:villageInfo”

minOccurs=”0” maxOccurs=”unbounded”/>

XML Document<village>

<name>Innsbruck</name>

<country>Tyrol</country>

<accommodation>Hotel Post</accommodation>

<accommodation>Hotel Admiral</accommodation>

<accommodation>Hotel Anker</accommodation>

</village>

Figure 2. Composite element type with element content.

schema information of an explicit schema specification is replicated within XML docu-ments in that each element and each attribute value is annotated with the correspondingelement type name and attribute name, respectively. The instance level of an RDBS isquite simpler, since values exclusively belong to attributes, which are in turn composed totuples.

2.3. Uniqueness of names

The name of a relation is required to be unique within the whole relational schema, similarto the name of an XML element type being unique throughout the DTD. By means ofso called namespaces [67], XML allows element types having the same name by usingdifferent namespace prefixes. Namespaces, however, are not further considered in thispaper. XML Schema is more flexible in this respect since the name of an XML elementtype has to be unique within a so-called symbol space, only. A symbol space is amongothers associated with each composite domain defined by a user. Thus, the same namemay appear in composite element types being defined on the basis of different compositedomains without conflict [69]. For example, two composite domains may both contain anelement type with name address but different domains without conflict. The name ofan XML attribute defined within a DTD or an XML Schema has to be unique within itselement type, again similar to an RDBS attribute’s name which has to be unique within itsrelation.

Page 8: Integrating XML and Relational Database Systems

350 G. KAPPEL ET AL.

2.4. Null values and default values

Similar to RDBS, XML allows to express null values6 as well as default values. In RDBS,the concept of null values is defined for attributes, only. XML, however, supports nullvalues for both attributes and elements. In DTDs, default values may be applied to XMLattributes, only, whereas XML Schema supports default values for XML element types,too.

Concerning XML attributes, the so-called default declaration within a DTD requires tospecify for each attribute one of the following constraints:

• #REQUIRED, meaning that a value is required in the sense of NOT NULL of RDBS.• #IMPLIED, denoting the optional nature of an attribute value, expressed by the omission

of NOT NULL in RDBS. Note, that in case there is no value provided for such an XMLattribute at the instance level, the attribute name is omitted within the XML document,too.

• #FIXED <ConstValue>, defining a constant value which is not possible in RDBS.• <DefaultValue>, specifying a default value analogous to the DEFAULT clause in

RDBS.

In XML Schema, there is an additional constraint for attributes named prohibited,which allows to mask an inherited attribute for the actual element type. These constraintscan be expressed by the attribute use (possible values: required, optional which cor-responds to #IMPLIED in DTDs, and prohibited) and the attributes default and fixedstoring a default value and a fixed value, respectively. Concerning an element, whether itmay be omitted or not is specified within both DTDs and XML Schema by means of car-dinality constraints. The cardinality specifies how often the element of a certain elementtype occurs as component element of its composite element. Since element types may becomponents of more than one composite element type, each of its occurrences as compo-nent element type can exhibit another cardinality. The cardinality symbols for DTDs are‘?’ (zero or 1), ‘*’ (zero or more), ‘+’ (1 or more) and no symbol (exactly 1). In XMLSchema, the cardinality can be specified in more detail by using the attributes minOccursand maxOccurs (cf. Table 2).

It has to be emphasized that there is a semantic difference between a start tag beingdirectly followed by an end tag and start tag and end tag being omitted at all from theXML document. The former matches to one of three different specifications within DTDsand XML Schema:

Table 2. Comparison of concepts: Cardinality.

Cardinality UML DTD XML Schema

minOccurs maxOccurs

Zero or one 0..1 ? 0 1 (default)Exactly one 1 default, no symbol 1 (default) 1 (default)Zero or more 0..*, * * 0 unboundedOne or more 1..* + 1 (default) unboundedArbitrary cardinality, e.g., three to five 3..5 not supported 3 5

Page 9: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 351

(1) An element is specified as an empty element type.(2) An element is specified as an atomic element type, whose value is an empty string.(3) An element is specified as a composite element type, but within the particular XML

document, no component elements exist.

In contrast to that, the omission of tags indicates a null value in the sense of RDBS.Using XML Schema, an alternative would be to set the special attribute xsi:nil to thevalue true. Note, that there is no corresponding mechanism for XML attributes. XMLSchema provides also a boolean attribute nillable, indicating whether an element isallowed to have neither text content nor element content, despite having a specificationrequiring content. In such a case the element must have an attribute xsi:nil with thevalue true.

2.5. Identification

In RDBS, the unique identification of tuples is done by means of a primary key, whichmay be composed of one or more attributes of the corresponding relation (cf. Table 3). InDTDs, only a single attribute of an element type can be designated as identifying attributeby means of the special attribute type ID which may in turn contain a string value (cf.Figure 3).

Table 3. Comparison of concepts: Identification.

RDBS DTD XML Schema

Concept Primary key ID attribute type Key (in addition to DTD concept)

Compositekey

Yes – one or moreattributes of a relation

No – single attribute ofan element type

Yes – one or more XML attributes oratomic element types

Scope ofidentification

Unique identification oftuple within relation

Unique identification ofelement within document

Unique identification of element within ascope identified by an XPath expression

Optional key Yes Yes Yes

Equality No distinction No distinction No distinction& identity

Figure 3. Exemplary identification in XML.

Page 10: Integrating XML and Relational Database Systems

352 G. KAPPEL ET AL.

In addition to the DTD concept ID, XML Schema allows not just attributes, but alsoelement types of an arbitrary atomic domain and combinations thereof to serve as keys.The scope of identification in RDBS is a single relation, i.e., the value of the primarykey uniquely identifies each tuple within a relation. In DTDs, the scope of identificationis broader in the sense that the value of an ID attribute is unique within the whole XMLdocument. This allows the unique identification of an element not only with respect to otherelements of the same element type but rather across all elements of any element type. XMLSchema allows to specify the scope for each key by means of an XPath [70] expression (cf.Figure 3, attribute xpath of element selector). Another XPath expression denotes theelement types and/or attributes serving as key (attribute xpath of element field).

In DTDs and XML Schema, element types are not required to contain an ID attribute ora key, respectively. This is similar to RDBS products, where relations need not contain aprimary key. Note, this is in contrast to the theory of the relational model, where primarykeys are mandatory for each relation. Concerning DTDs, even in case that an elementtype has an attribute of type ID, its usage may be optional by defining it as #IMPLIED.In contrast, keys in XML Schema must be always non-nillable. Since the identification ofboth, tuples in RDBS and elements in XML is value-based, it is not possible to distinguishbetween equality and identity as it is possible in the object-oriented data model [14] andin the XQuery 1.0/XPath 2.0 data model [72]. As can be seen, whereas keys in RDBS andin XML Schema are very similar, keys and attributes of type ID are rather heterogeneousconcepts.

2.6. Relationships

In RDBS, relationships can be expressed between relations by means of foreign keys, i.e.,arbitrary attributes that refer to the primary key of the same relation or of another relation.The number of tuples which may participate in a relationship can be constrained by defin-ing the foreign key as NOT NULL and/or UNIQUE. With this, different cardinalities can besupported as illustrated in Table 4. DTDs allow two alternative ways for specifying rela-tionships between element types comprising IDREF(S) attributes and component elementtypes (cf. Section 2.2). Attributes of type IDREF(S) represent some kind of foreign key

Table 4. Comparison of concepts: Relationships.

Page 11: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 353

referencing attributes of type ID. The distinction between IDREF attributes and IDREFS

attributes concerns their cardinality, in that the former are single-valued and the latter aremulti-valued. In contrast to RDBS, where the participating tuples are constrained to theparticipating relation, using IDREF(S) the participating elements cannot be constrained tobe of a certain element type.

XML Schema supports the concept of so-called keyref, which is similar to the RDBSconcept of foreign keys, meaning that a certain element/attribute combination refers tothe corresponding element/attribute combination building the key. Different to DTDs, theparticipants of a relationship are typed by the element type containing the key. Regardingrelationships which are realized by specifying component element types the cardinality ofcomponent element types can be an arbitrary value as already mentioned in Section 2.4.Further, the participating elements may not only be component elements as required whenusing DTDs, but may also be elements derived from these component elements.

2.7. Order

In contrast to relations and tuples in RDBS, the element types and elements of an XMLdocument adhere to both an explicit and implicit order. The order of element types can beexplicitly defined within a DTD by using the sequence operator ‘,’ whereas XML Schemauses the element type sequence. The example shown in Figure 4 specifies that an elementof type village comprises the following three component element types in the specifiedorder, i.e., name has to occur first, then country, and then accommodation.

At the instance level, the order of concrete elements is defined implicitly by the positionof elements within the XML document (cf. also Figure 4). Note, that this implicit ordermay not contradict the explicit order defined by the corresponding DTD. In our example theorder of the particular accommodation elements is given at the instance level by occurringat a certain position within an element of type village. It is important to be aware, thatelements occurring as component elements of different composite elements do not alwayshave to exhibit the same order.

For example, elements of type accommodation being component elements of typevillage may show a different order than as component elements of type owner. Animplicit order not only concerns elements of the same element type but also elements ofdifferent element types, as is depicted in Figure 5.

Figure 4. Explicit and implicit order.

Page 12: Integrating XML and Relational Database Systems

354 G. KAPPEL ET AL.

Figure 5. Implicit order between elements of different element types.

Figure 6. Unordered component element types.

In addition to these concepts, XML Schema allows the explicit definition of an un-ordered occurrence of component element types by means of the element type all (cf. Fig-ure 6). Note, that using all, the cardinality constraints are restricted in that minOccursmay have the values 0 and 1 and maxOccurs may have the value 1, only.

3. Patterns for mapping XML and RDBS

After having analyzed differences between XML concepts and RDBS concepts, let’s con-sider the possibilities for mapping a DTD to a relational schema. This section proposessome basic mapping possibilities and determines on their basis which kind of mappingis reasonable in a certain situation thus representing so-called mapping patterns7. Thesemapping patterns are universally applicable and have been used as a basis for designing ameta schema for representing mapping knowledge in our X-Ray approach (cf. Section 5).

3.1. Basic kinds of mappings

A straightforward way would be to map each element type to a relation and each XMLattribute to an attribute of the respective relation (cf. Figure 7). Due to data model hetero-geneity and schema heterogeneity, however, such a one to one mapping is neither alwayspossible nor desirable. For example, in the presence of deep element nesting directly map-ping elements to tuples of different relations would lead to excessive fragmentation of thedocument over various relations, thus decreasing performance.

When considering the structuring mechanisms of XML and RDBS as discussed in Sec-tion 2.2, three basic kinds of mappings at the data model level may be distinguished (cf.Figure 8):

Page 13: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 355

Figure 7. Straightforward mapping of XML concepts to relational concepts.

Figure 8. Basic kinds of mappings.

(1) ET_R. An element type (ET) is mapped to a relation (R), further on called base rela-tion. Note, that several element types can be mapped to one base relation. An examplefor an ET_R mapping is the mapping of element type accommodation to relationAccommodation in Figure 8.

(2) ET_A. An element type is mapped to a relational attribute (A), whereby the relationof the attribute represents the base relation of the element type. Note, that severalelement types can be mapped to the attributes of one base relation. An example foran ET_A mapping is the mapping of element type name to attribute Name of relationAccommodation in Figure 8.

(3) A_A. An XML attribute is mapped to a relational attribute whose relation representsthe base relation of the XML attribute. Again, several XML attributes can be mappedto the attributes of one base relation. The mapping of XML attribute id to attributeAccID of relation Accommodation in Figure 8 gives an example for an A_A mapping.

It has to be emphasized that both element types and attributes can be mapped to a singlebase relation and a single attribute, only. Another point is that ET_A and A_A mappingsdetermine also the instance level, in that database values are mapped to XML values. Thus,it makes sense that ET_R mappings occur together with ET_A and A_A mappings.

Furthermore, it is not mandatory that all element types and attributes of a DTD as wellas all relations and attributes of a relational schema have a mapping. An example at therelational side could be a foreign key that serves for establishing a relationship but might

Page 14: Integrating XML and Relational Database Systems

356 G. KAPPEL ET AL.

Figure 9. Exemplary mappings.

not be relevant within the XML document and therefore requires no mapping. An exampleat the XML side would be an empty element type that occurs exactly once at a certainposition within the XML document and does not require any mapping, neither.

The examples demonstrate that the omission of mappings is imaginable not only in casethat both DTD and relational schema have been developed independently from each other,but also if one has been derived from the other one. However, in case that the cardinalityof a relationship or an element type, respectively, or the default declaration of an attributerequires the existence of a corresponding instance, a proper mapping is mandatory.

The three basic kinds of mappings introduced above can be further refined with respectto the determination of an element type’s base relation. First, if an element type shouldbe mapped, one has to consider the first of its ancestor element types that is mapped toa relation or an attribute, thus having a base relation. This base relation constitutes theparent base relation of the XML element type which should be mapped and is a candidatefor being its base relation, too. If none of its ancestor element types is mapped, an arbitraryrelation can be chosen as base relation. Concerning the example in Figure 9 (cf. also themore comprehensive example given in Figure 10), the element types address, street,and country all have the same parent base relation, namely Accommodation, whichrepresents the base relation of the ancestor element type accommodation. Note, thataiming at an intuitive presentation, Figure 9 depicts mappings between XML element typesand relations in terms of a UML class diagram [52]. To be able to distinguish betweenelement types and relations, they are depicted as instances of the corresponding meta classElementType and Relation, respectively.

Second, if an XML attribute should be mapped, its element type has to be consideredfirst. If the attribute’s element type is not mapped, its ancestor element types have to beconsidered as done for element types discussed above. Again, the relation which the first ofthese ancestor element types is mapped to represents the parent base relation of the XMLattribute, thus being a candidate for being its base relation, too. The parent base relation

Page 15: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 357

Figure 10. Exemplary DTD, relational schema, and UML class diagram.

constitutes also the base relation, if the XML element type or the attribute, respectively,can be mapped to the relation or one of its attributes, which is furtheron called directmapping. For an example, confer to the element type street in Figure 9, which is directlymapped to an attribute of its parent base relation Accommodation. Otherwise, a properbase relation may be one of those relations, reachable by the parent base relation via foreignkey relationships, which is furtheron called indirect mapping. For an example, considerthe element type country, which is indirectly mapped to an attribute of relation Countryreachable by its parent base relation Accommodation. Indirect mapping is reasonablein case that the relational attribute, which should be the mapping target, is factored outfrom the parent base relation, e.g., due to normalization reasons or because of verticalpartitioning. Note, that element type address is used to group address data and thus hasno relational counterpart and no base relation at all.

Both direct and indirect mapping is applicable to the three basic mapping possibilitiesintroduced above thus resulting in ET_Rdirect/indirect, ET_Adirect/indirect, and A_Adirect/indirectmappings. Furthermore, the possibility of a direct mapping always implies the possibilityof an indirect mapping due to vertical partitioning. This differentiation is also made in [59],where it is proposed to inline as many sub elements as possible to reduce fragmentation

Page 16: Integrating XML and Relational Database Systems

358 G. KAPPEL ET AL.

(direct mapping) and to keep multi-valued elements and elements involved in recursiveassociations in separate tables (indirect mapping).

3.2. Reasonable mappings

After introducing the basic kinds of mappings, this section discusses reasonable mappings.Reasonable mappings may serve as mapping patterns, when mapping XML concepts toRDBS concepts. If one tries to map two existing schemata to each other, mapping pat-terns can be used to facilitate this mapping process at a syntactical level by analyzing thestructure of both schemata and proposing potential mappings as well as preventing othersbecause of syntactical conflicts. This focus is different to approaches supporting defaultmapping rules to derive one schema from the other (cf., e.g., [9,32,54,59]).

The determining factors can be categorized into characteristics of the XML element type(cf. Section 3.2.1) and characteristics of the XML attribute (cf. Section 3.2.2). In order toillustrate the subsequent investigations, in Figure 10 we provide a comprehensive runningexample building on the ones given in the previous section. The example is intended toshow as many mapping possibilities as possible.8 Figure 10 shows the running examplein terms of a DTD and in terms of a relational schema. The latter is depicted with atable structure and as UML class diagram better visualizing relationships. Concerning therelational schema, primary keys are formatted in bold face and underlined, foreign keysare depicted using italic type.

Even this small example shows that data model heterogeneity and schema heterogeneityprevent a simple one to one mapping. Regarding the DTD illustrated in Figure 10, thereis a single root element type accommodations having no relational counterpart. Its com-ponent element type accommodation contains various element types, which have eitherrelational attributes or relations as counterparts. The different cardinalities specified foreach of these element types correspond to those defined at the relational side. Regardingthe composite element type address and its atomic component element types street,village, and country it can be seen that the relational schema does not contain a re-lation Address with attributes Street, Village, and Country. Even more, there doesnot exist any counterpart for address in the relational schema and its component elementtypes correspond to attributes located in three different relations, connected by ‘*:1’ rela-tionships, namely attribute Street of relation Accommodation, attribute Name of relationVillage, and attribute Name of relation Country. Having three relations instead of oneis the consequence of the normalization process.

The element type accommodation as well as some of its component element types con-tain attributes. One of these attributes, namely state, has the fixed value ‘Austria’ andtherefore lacks a relational pendant. The attribute kind is restricted to an enumeration oftwo values. The composite element type description has mixed content, comprisingthe atomic element type rating meaning that elements of this type may occur severaltimes mixed with atomic data in any order within an XML document. Note, that the at-tributes RatingOrder of the two classes ActualRating and RatingDescription arenot mapped to any XML concept. They express an absolute order over both rating descrip-tions and actual ratings with respect to a certain accommodation. This is not necessary at

Page 17: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 359

the XML side, since the order is implicitly defined by the position of the elements withinthe XML document.

3.2.1. Element type characteristics As already mentioned, choosing a certain mappingis based on characteristics of the element type to be mapped. As illustrated in Figure 11,these decisive characteristics can be categorized into three orthogonal dimensions com-prising the kind of element type, if it contains attributes, and its cardinality. Note, thatif the element type has been declared to have ANY content (cf. Section 2.2), a reasonablemapping cannot be determined in advance. Depending on the combination of these char-acteristics, certain reasonable mappings can be determined as shown in Table 5. In thefollowing, these mappings are discussed by means of the running example.

First, we consider composite element types with element content. Mapping this kindof element type is neither influenced by cardinality nor whether it contains any attributes.Since there are no values associated with elements of this type, the only reasonable map-ping possibility is ET_R. Depending on whether the element type can be mapped to itsparent base relation or not, ET_Rdirect or ET_Rindirect mapping can be used. In fact, thelack of any mapping would not result in a loss of information, since elements of this typecontain no values which could be stored in the database.

Figure 11. Orthogonal dimensions characterizing XML element types.

Table 5. Reasonable mappings of XML element types.

Kind of element type Contains attributes Cardinality Reasonable mapping

Composite ET with element content No influence No influence ET_Rdirect/indirect; No mapping

Atomic ET No influence ?, 1 ET_Adirect/indirectAtomic ET No influence +, * ET_AindirectEmpty ET No 1 No mappingEmpty ET Yes 1 ET_Rdirect/indirect; No mappingEmpty ET No influence ? ET_AdirectEmpty ET No influence *, + ET_Aindirect

Composite ET with mixed content No influence No influence ET_Aindirect

Page 18: Integrating XML and Relational Database Systems

360 G. KAPPEL ET AL.

Concerning our running example, whereas the root element type accommodations

does not require any mapping, the element type accommodation is mapped to the re-lation Accommodation (ET_R mapping). Since accommodation does not have a parentbase relation, we do not distinguish between a direct and an indirect mapping in this case.

Next, let us consider the mapping of an atomic element type. The reasonable mappingsof such element types depend on the cardinality, only, and are not influenced by the exis-tence of XML attributes. Since atomic element types contain values they always require amapping to relational attributes, i.e., an ET_A mapping. In case of cardinality ‘?’ and ‘1’,an ET_Adirect mapping is possible, since no more than one element may occur. However,also an ET_Aindirect mapping may be necessary, when the relational attribute which theatomic element type should be mapped to is not part of the parent base relation. In case ofcardinality ‘*’ and ‘+’, ET_Aindirect mapping is required due to normalization.

Concerning our running example, the most simple case is represented by element typename which has cardinality ‘1’ and is mapped to attribute Name of base relation Accommo-dation representing an ET_Adirect mapping. Accommodation is mapped to element typeaccommodation, the direct ancestor element type of element type name, i.e., the base rela-tion and the parent base relation are the same. This kind of mapping also applies to elementtype street. In this case the parent element type address has no mapping and the an-cestor element type accommodation is mapped to the relation that contains the relationalcounterpart Street. The element types village and country require ET_Aindirect map-pings, since their relational counterparts are stored in base relations different to the parentbase relation Accommodation due to normalization reasons. The relational counterpartsare attribute Name of base relation Village and attribute Name of base relation Country,respectively. This kind of mapping is possible, since Accommodation and Village, aswell as Village and Country are directly connected via foreign key relationships. El-ement type email has cardinality ‘*’ requiring an ET_Aindirect mapping and therefore ismapped to attribute Email of relation EmailAddress. The same holds true for elementtype rating with the difference that the parent base relation Accommodation and thebase relation PossibleRating containing an attribute Rating are indirectly connectedvia the relation ActualRating thereby explicitly demonstrating schema heterogeneity.Another example for schema heterogeneity is given by the empty element type pool whichis mapped to the relational attribute Name of relation Pool storing the names of pools.

Regarding empty element types with a cardinality ‘1’, no matter if there are attributesor not, no mapping is required since a corresponding element occurs exactly once withoutcarrying any value. However, if there were attributes, it would make sense to employ adirect or indirect ET_R mapping since the base relation could serve as the base relation forthe attributes. In case of any other cardinality, the existence of attributes does not influencethe reasonable mappings. An ET_A mapping is required in any case. It depends on theparticular cardinality whether a direct or indirect mapping is reasonable.

Referring to our example, the empty element types facilities without attributes andsauna including a single attribute represent the most simple case both having a cardinalityof one thus requiring no mapping. The attribute available of element type sauna ismapped to the relational attribute Sauna of the parent base relation of the element typesauna, namely Accommodation. The optional empty element type acceptsCredit-

Page 19: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 361

Card contains no attributes and is mapped directly to the relational attribute Accepts-

CreditCard of its parent base relation Accommodation. Finally, the empty elementtypes phone and pool having a cardinality of ‘+’ and ‘*’, respectively, are mapped viaET_ Aindirect to the relational attribute Number of the relation Phone and the relationalattribute Name of the relation Pool, respectively.

Considering composite element types with mixed content, neither the existence of at-tributes nor the cardinality have any influence on the reasonable mappings. Since at theinstance level, several values may occur within a single element, an ET_Aindirect map-ping is required. Our example contains one composite element type with mixed content,namely description, which is mapped to the attribute Description of the relationRatingDescription. The attributes RatingOrder of the two relations ActualRatingand RatingDescription are, as already mentioned, not mapped to any XML concept,since they express an absolute order over both rating descriptions and actual ratings withrespect to a certain accommodation.

It has to be noted that in case that one or several ancestor element types are not mappedand any of these ancestor element types depicts a cardinality of ‘*’, the next componentelement type being mapped can be mapped indirectly, only.

3.2.2. XML attribute characteristics The mapping of XML attributes depends on twoorthogonal dimensions comprising the multiplicity of the XML attribute, i.e., whether it issingle-valued or multi-valued, and the default declaration (cf. Figure 12). Considering thedifferent combinations of these two dimensions three reasonable mapping alternatives maybe identified as shown in Table 6.

For XML attributes with default declaration being #FIXED, no mapping is necessaryindependent of the multiplicity of the XML attribute. In our example, the XML attributestate of the element type accommodation has the constant value Austria. Regarding

Figure 12. Orthogonal dimensions characterizing XML attributes.

Table 6. Reasonable mappings of XML attributes.

Multiplicity of XML attribute Default declaration Reasonable mapping

No influence #FIXED No mapping

Single-valued #REQUIRED, #IMPLIED, Default Value A_Adirect/indirect

Multi-valued #REQUIRED, #IMPLIED, Default Value A_Aindirect

Page 20: Integrating XML and Relational Database Systems

362 G. KAPPEL ET AL.

XML attributes which are not specified to be #FIXED, it has to be distinguished whetherthey are single-valued like attributes of type CDATA or multi-valued like attributes of typeIDREFS. Single-valued attributes can be directly mapped to relational attributes (A_Adirect)or may require indirect mapping due to normalization reasons (A_Aindirect), whereas multi-valued attributes may be mapped indirectly (A_Aindirect), only. Considering attributes oftype ID and IDREF(S), it seems conceivable to map them to primary key attributes andforeign key attributes, respectively, of the relational schema. Due to data model hetero-geneity, however, this is not always feasible, since there are differences concerning scopeand composite keys (cf. Sections 2.5 and 2.6).

In our example, directly mapped single-valued attributes comprise id and kind of ele-ment type accommodation, number of element type phone, and available of elementtype sauna. Single-valued attributes which have to be mapped indirectly are postalCodeof element type address, and yearOfFoundation of element type village. Multi-valued attributes are not part of our example. It has to be emphasized that with one excep-tion the reasonable mappings of an attribute are independent of the kind of mapping of itselement type. In case that the element type of the attribute is not mapped and any of itsancestor element types that is not mapped depicts a cardinality of ‘*’, the attribute can bemapped via A_Aindirect, only.

4. Design alternatives for integrating XML and RDBS

This section discusses design alternatives for integrating XML and RDBS, together withdesign goals and corresponding design decisions taken for our system X-Ray (cf. Sec-tion 5). As already mentioned, our focus is on integrating already existing schemata repre-senting the universe of discourse. Therefore, those design alternatives storing XML docu-ments within a single attribute or decomposing them according to their graph structure arenot further considered. The design alternatives considered also constitute the basis for thediscussion of related work as done in Section 6 and can be categorized into three dimen-sions, comprising the schemata which should be integrated, the mapping between theseschemata, and the access to the stored data (cf. Figure 13).

Kind of schemata. Regarding the first dimension, one has to consider the kind ofschemata at both, the XML-side and the DB-side, offering a derived approach and a user-defined approach (cf. Figure 14). The derived approach requires that either the DB schemais derived from the XML schema according to certain pre-defined rules or vice versa. Con-cerning the derivation process, one can distinguish different degrees of automatism allow-ing to configure derivation rules manually or not. The user-defined approach allows todevelop the DB schema independent of the XML schema and vice versa. This is typicalfor an electronic commerce scenario where part of the content stored within the databaseshould be transferred to business partners according to a standardized schema or if XMLdocuments received should be stored within an existing DB. The mapping between theschemata is not derived on the basis of pre-defined rules but rather defined by a user, even-tually with appropriate support by the system. Since a major design goal of X-Ray is to

Page 21: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 363

Figure 13. Design alternatives.

Schema at DB-sideDerived User-defined

Schema at Derived Not applicable Derived approachXML-side User-defined Derived approach User-defined approach

Figure 14. Kind of schemata.

support existing schemata rather than to (semi)-automatically derive schemata from eachother, the only feasible design decision is to adhere to the user-defined approach. To copewith the various heterogeneity problems arising when mapping two existing schemata, weprovide mapping patterns for resolving data model and schema heterogeneities.

Representation of mapping knowledge. To perform the necessary transformations wheninserting or retrieving XML documents or parts thereof, appropriate mapping knowledgehas to be managed by the system in one way or the other. Regarding its representation, onecan distinguish between hard-coding mapping knowledge within applications or within aquery, respectively, and reifying the mapping knowledge within a file or within a DB. Hard-coding mapping knowledge can be done either at runtime, meaning that the user issuing therequest must have knowledge about the two schemata and the mapping in between or al-ready at initialisation time, thereby ensuring mapping transparency for subsequent access.Hard-coding is a feasible solution in case of the above mentioned derived approach, sincethe mapping knowledge required remains the same independent of changes to the initialschema. If changes to the mapping knowledge are likely, hard-coding is very inflexible,since such changes would require implementation efforts and possibly a recompilation of

Page 22: Integrating XML and Relational Database Systems

364 G. KAPPEL ET AL.

the system. According to [25], hard-coding mapping knowledge within a query may inaddition result in very large and complex queries, decreasing flexibility and maintainabil-ity. In contrast to that, reification of mapping knowledge facilitates retrieval and main-tainability, especially if, instead of plain files, a DB is used for storage purposes. Sincean important design goal of X-Ray is to realize mapping transparency while enhancingmaintainability, X-Ray stores mapping knowledge within a meta schema, managed by anRDBS.

Coupling with schemata. Coupling of mapping knowledge with the schemata involved inthe integration can be either tight or loose. Tight coupling means that mapping knowledgeis intermingled with either the XML schema or the DB schema, whereas loose couplingallows to store mapping knowledge separately. The main drawback of tight coupling is thatit requires to change existing schemata, thus violating schema autonomy. Since schemaautonomy is a crucial design goal of X-Ray, we adhere to the loose coupling approach. Forthis, the above mentioned meta schema reifies the schemata at both, the XML-side and theDB-side.

Mapping cardinality. The criterion of mapping cardinality describes, on the one hand, thepossibility of mapping a certain XML schema to multiple different schemata at the DB-sideand on the other hand the opportunity of mapping a certain DB schema to multiple differ-ent schemata at the XML-side. Mapping of a certain XML schema to multiple schemataat the DB-side would allow to provide a global XML view to relational data stored in sev-eral different databases, e.g., product databases from different subsidiaries. Vice versa, acertain DB schema could be used to supply several XML documents according to differentschemata with relational data. For example relational data could be published to varyingschemata given by different business partners. Since a design goal of X-Ray is to supportmultiple schemata, both options are supported by allowing multiple relationships betweenthe reified schemata.

Access capability. The criterion of access capability describes whether the focus of asystem is on storing XML documents or on publishing XML documents or other relationaldata out of a DB. X-Ray aims at providing a unified approach, therefore supporting bothdirections.

Access language. The languages used for storing and retrieving XML documents or partsthereof can be mainly divided into DB-centric (e.g., SQL or extensions thereof) and XML-centric (e.g., XQuery or XPath). There are also other possibilities, e.g., by providingpredefined functions. The goal of X-Ray in this respect is to support homogeneous access,meaning that both, the query language and the result adhere to the same data model. Thus,X-Ray is based on the XML-centric approach, using an XML query language.

Access target. The access target reflects the fact whether a request can be issued directlyagainst the DB schema, against the XML schema, and/or against the mapping knowledgeitself. Directly accessing the DB schema demands the user to have knowledge about two

Page 23: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 365

schemata belonging to different data models and the required mapping in between. If theaccess target is an XML schema, this schema constitutes an XML view over the relationalschema thus achieving DB schema transparency meaning that the user is not concernedwith the structure of the underlying DB schema. The necessary transformations and thefinally required access to the relational schema is performed by the system, automatically.If this view is virtual, retrieval and materialization is done for the actually accessed data,only. Performance reasons, however, could also endorse materializing the view or parts ofit, respectively. As X-Ray aims at achieving DB schema transparency, an XML schemarealizing a virtual XML view is used as access target. In addition, convenient access tomapping knowledge is facilitated, since the schemata to be mapped are reified togetherwith the mappings in between within a meta schema.

5. Overview of X-Ray

This section is dedicated to an overview of the X-Ray approach. The main focus is on themeta schema component which represents one of the most distinguishing characteristicswith respect to other closely related approaches (cf. Section 6). It is not the aim of thissection to provide an in-depth description of all X-Ray components.

5.1. Architecture

The overall architecture of X-Ray consists of three main components, namely the genericmeta schema, the mapping knowledge editor, and the composer/decomposer component(cf. Figure 15).

Before X-Ray can be used for storing and retrieving XML documents, the mappingknowledge required for mapping a certain XML schema to a certain relational schema hasto be specified in an initialization phase. To support this task, the X-Ray architecture pro-vides a mapping knowledge editor. On the basis of the database schema and the DTD,

Figure 15. Architecture of X-Ray.

Page 24: Integrating XML and Relational Database Systems

366 G. KAPPEL ET AL.

the user may interactively specify the required mappings, guided by the proposed mappingpatterns. As soon as the system is initialized with the mapping knowledge which is storedwithin the meta schema repository, the user is able to transparently issue queries usingQuilt [17] against a virtual XML view specified by a DTD. It is also possible to accessthe XML view and one or several XML documents no matter where they are stored bya single query. Quilt is a second generation XML query language having major impacton XQuery [71], the XML query language proposed by the W3C, since it was developedby synthesizing concepts from several other XML query languages. Utilizing the map-ping knowledge the query is decomposed into corresponding SQL queries on the relationaldatabase. The result is used to compose XML documents out of flat relational data. Thecomposer/decomposer component serves for storing and retrieving XML documents andtherefore performs all necessary transformations based on the mapping knowledge storedin the meta schema. To realize the composition/decomposition task, Kweelt [55], a Javaframework for querying XML based on Quilt, has been used and slightly extended. A pro-totype of X-Ray is operational and supports retrieval and storage of XML documents [31].

5.2. Meta schema

The insights gained in the previous sections concerning data model heterogeneity, schemaheterogeneity, and the mapping possibilities between XML and relational schemata pro-vide the basis for the design of the meta schema of X-Ray. The meta schema is the keymechanism for the genericity of X-Ray allowing to map DTDs9 and relational schemata.The meta schema consists of three components describing the relevant meta knowledge(cf. Figure 16). The DBSchema component is responsible for storing information aboutrelational schemata that shall be mapped to DTDs to make their data available to XMLdocuments or that shall be used to store XML documents. Analogously, the XMLDTD com-ponent stores schema information about XML documents as specified by means of DTDs.Finally, the XMLDBSchemaMapping component stores the mapping knowledge betweenDBSchema and XMLDTD. The goal of XMLDBSchemaMapping is to bridge both data modelheterogeneity and schema heterogeneity in order to support a proper mapping.

In X-Ray, a database schema is not limited to be mapped to a single DTD but may bemapped to several DTDs and vice versa. This is reasonable since, due to presentationrequirements, it may be necessary to represent a particular piece of information by severalXML documents being based on different DTDs. Likewise, if we assume several relationalschemata storing data of the same domain it may be required to represent these data byXML documents based on the same DTD. Concerning the storage of the meta knowledgeitself, X-Ray comprises both a relational representation of the meta schema stored within a

Figure 16. Components of the X-Ray meta schema.

Page 25: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 367

relational database and an object-oriented representation for main memory mapping. Thelatter is being initialized with the content of the relational meta schema at the beginningof an X-Ray session, herewith allowing an efficient composition and decomposition ofXML documents at runtime. The object-oriented representation in terms of UML classdiagrams is also used throughout this section to concisely and precisely depict the variousmeta schema components.

5.2.1. Database schema component Concerning the database schema component, ithas to be emphasized that it is not necessary to store meta knowledge about the completerelational schema, but only about those relations and attributes being relevant for the map-ping to the DTD. However, not only base relations and their attributes are relevant, but alsonon-base relations which are the connecting relations between two base relations.

As illustrated in Figure 17, DBSchema contains at least one DBRelation, which con-sists of at least one DBAttribute. DBAttribute stores among others its atomic domainand whether it represents a primary key attribute. DBRelation and DBAttribute aregeneralized to DBConcept. Relationships (DBRelationship) connect two relations andspecify one or more join segments (DBJoinSegment) comprising the join attributes, i.e.,primary key and foreign key attributes of two relations that realize the relationship. The re-lationship comprises more than one join segment in case that the primary key is composedof two or more attributes. In case that parts of an XML document are stored within differentrelations, information about the proper join paths (DBJoinPath) is necessary. A DBJoin-

Path consists of one or more relationships. It comprises more than one relationship ifmore than two relations have to be joined for composing or decomposing a particular partof an XML document. Note, that there is no difference between relationships connectingdifferent relations and those refering to one and the same relation, neither with respect tostoring information within the meta schema about a recursive relationship or a recursiveelement type, respectively, nor concerning the mapping between them.

5.2.2. XML DTD component Similar to the database schema component, it is not nec-essary to store meta knowledge about the complete DTD, but only about those parts beingrelevant for the mapping to the relational schema10. The meta knowledge specifies thata DTD (XMLDTD, cf. Figure 18) has a certain element type (XMLElemType) that servesas root. For element types with attributes, XMLAttribute stores information about theiratomic domains and their default declaration. Similar to the database schema component,XMLElemType and XMLAttribute are generalized to XMLConcept. For enumeration

Figure 17. Meta schema of the relational schema.

Page 26: Integrating XML and Relational Database Systems

368 G. KAPPEL ET AL.

Figure 18. Meta schema of the DTD.

Figure 19. Meta schema of the XML composition structure.

attributes the possible values are stored within XMLAttValEnum. According to the distinc-tion made in Section 2.2, XMLElemType is specialized into XMLAtomicET, XMLEmptyET,and XMLCompositeET. The latter is further specialized into XMLCompositeETMixed-

Content and XMLCompositeETElemContent.The nesting structure of an XMLCompositeETElemContent is described by the pack-

age CompositionStructure (cf. Figure 19). For an XMLCompositeETMixedContent

the nesting structure needs not to be represented in the meta schema, since, as alreadymentioned, component element types are allowed to occur in a choice with cardinality ‘*’,only.

For component element types occurring in an XMLSequence or in an XMLChoice, thecardinality of the element type and in case of a sequence its position have to be stored.Furthermore, arbitrary combinations of sequences and choices can be described.

5.2.3. Mapping knowledge The mapping knowledge is expressed by various associa-tions between the object classes of the XML DTD component and the database schemacomponent. Figure 20 illustrates these mapping relationships denoting them with boldlines. For representation convenience, only those object classes are shown which arepart of a mapping relationship. In order to meet the requirement that the meta schema isable to store mappings between different DTDs (XMLDTD) and different database schemata(DBSchema), the mapping between the class XMLConcept and the class DBConcept takespart in a ternary relationship with the association class XMLDBSchemaMapping. As dis-cussed in Section 3.2, deciding on the exact kind of element type is a prerequisite for

Page 27: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 369

Figure 20. Meta schema describing the mapping knowledge.

deciding a reasonable mapping to a database concept. Consequently, the leaf classesof the XMLElemType hierarchy are mapped to DBAttribute with two exceptions. Theclass XMLCompositeETElemContent is mapped to DBRelation, and the mapping ofclass XMLEmptyET is not further refined, since it inherits the (ternary) association toDBConcept.

Besides the mapping relationships depicted in Figure 20, there are also relationships toclass DBJoinPath (cf. Figure 17) which are not illustrated for representation convenience.Due to space restrictions, the attributes of the various object classes are also not shown. Anexample mapping in terms of the filled-in meta schema can be found in [42].

6. Related work

This section provides an in-depth survey of related work. This survey comprise six differ-ent research prototypes as well as three of the most prominent commercial RDBS, namelyDB2, Oracle, and SQLServer. The rationale behind choosing these nine was to assort arepresentative mix of current approaches supporting different concepts which are closelyrelated to X-Ray. Another intent was to evaluate not only research approaches but alsocommercially available systems. Especially commercial systems often support differentintegration alternatives, each of them pursuing another goal. We have considered thosealternatives closely related to X-Ray.

Each of the selected approaches is described in the following within a separate subsec-tion using the design alternatives discussed in Section 4 as evaluation criteria. Thereby, weintended to provide an overall understanding of each approach before discussing generalfindings of the evaluation with respect to the X-Ray approach. The results of this sur-vey are summarized in Table 7. It has to be noted that there are already papers available,focusing on a comparison of different approaches for integrating XML and RDBS (cf.,e.g., [36,45,65]). The following survey is different to these, since it discusses the differentapproaches specifically from the viewpoint of the design alternatives considered by X-Ray.

Page 28: Integrating XML and Relational Database Systems

370 G. KAPPEL ET AL.

Tabl

e7.

Ove

rvie

wof

eval

uatio

nre

sults

.R

esea

rch

appr

oach

esC

omm

erci

alsy

stem

s

Ago

raL

egoD

BM

AR

SM

XM

Silk

Rou

teX

TAB

LE

SD

B2

Ora

cle

SQL

Serv

erX

-Ray

XM

LX

ML

XSU

XM

LFO

RA

nnot

ated

Ope

nE

xten

der

Type

Vie

wX

ML

Sche

mat

aX

ML

Sche

mat

aK

ind

ofde

rive

dap

proa

ch�

��

��

��

sche

mat

aus

er-d

efine

dap

proa

ch�

��

��

��

�M

appi

ngR

epre

sent

atio

nha

rd-c

oded

atru

ntim

e�

��

ofm

appi

ngha

rd-c

oded

know

ledg

eat

initi

alis

atio

ntim

e�

��

��

�re

ified

infil

e�

��

�re

ified

inD

B�

�C

oupl

ing

with

tight

��

��

��

sche

mat

alo

ose

��

��

��

��

�M

appi

ngm

ultip

leat

DB

-sid

e�

��

�ca

rdin

ality

mul

tiple

atX

ML

-sid

e�

��

��

��

Acc

ess

Acc

ess

Pub

lishi

ng�

��

��

��

��

��

capa

bilit

ySt

orag

e�

��

��

��

��

Acc

ess

DB

-cen

ter

��

��

lang

uage

XM

L-c

entr

ic�

��

��

��

��

�O

ther

��

��

�A

cces

sD

Bsc

hem

a�

��

�ta

rget

XM

Lsc

hem

a�

��

��

��

��

�M

appi

ngK

now

ledg

e�

Page 29: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 371

6.1. Agora

Schemata. Agora [46,47] is a data integration system which provides a global virtualXML schema. It integrates existing relational and DOM-compliant data sources and trans-lates XML queries on the global schema into (SQL) queries of the underlying data sourcesregardless of the kind of mapping employed. Also the global XML schema can alreadyexist in terms of, e.g., a standardized schema available as DTD or XML Schema, thussupporting a user-defined approach.

Mapping. The data sources are defined as SQL-views over a global XML schema (so-called “local-as-view” approach), thereby hard-coding the mapping knowledge at initial-isation time. For this, a normalised relational schema consisting of 11 tables serves asintermediate layer to losslessly represent the content of XML documents in a relationalway. This intermediate schema, which can be virtual, is also the basis for query trans-lation. It has to be emphasised that this intermediate schema is completely different tothe meta schema employed in X-Ray. First, it is independent of the mapping between theglobal XML schema and the real data sources, second, it is only used to get queries acrossthe language gap, and third, it stores the real content of the XML document instead ofinformation about this content. Coupling of the mapping knowledge with the schemata isloose, mapping cardinality seems to be multiple at DB-side, only.

Access. The main focus of Agora is on publishing existing (relational) data. The process-ing of XML-centric queries on the global XML schema, which are expressed by means ofXQuery, is done in three steps. First, the query is normalized, applying equivalent trans-formations to enable a direct translation to SQL, second the normalized query is translatedinto an SQL query on the intermediate schema (working still independent of the local datasources), and third, the SQL query is rewritten into a SQL query on the real data sourcesusing their definitions as views over the intermediate schema.

6.2. LegoDB

Schemata. LegoDB [9,51] is a cost-based XML storage mapping engine that exploresthe space of possible XML-to-relational mappings. In particular, LegoDB generates, withrespect to query performance, the “best” mapping and corresponding relational schemafor a given XML schema, an XML query workload and statistics over the XML data,thus supporting a derived approach. The XML schema expressed with the XML Schemastandard or DTDs, is converted (“normalized”) into a schema tree (so-called “physicalschema”) consisting of type constructors using an XML query algebra. For this normal-ization process, only a subset of XML Schema concepts is used, disregarding, e.g., thedistinction between groups and complex types as well as local and global declarations.Semantic preserving schema transformation operations (e.g., inlining/outlining, repetitionmerge/split) are then repeatedly applied to these physical schemata and cost estimation isdone for each transformed schema until a “good” DB schema and corresponding mappingis found using heuristics.

Page 30: Integrating XML and Relational Database Systems

372 G. KAPPEL ET AL.

Mapping. LegoDB uses the physical schemata as a basis for a fixed set of derivation ruleswhich are hard-coded within the system and separated from the schemata. In particular,LegoDB creates a table for each type, maps the contents of the elements into columns ofthat table and generates a key column that contains the ID of the corresponding element anda foreign key that keeps track of the parent/child relationship. Based on these derivationrules, an extended XML Schema parser is used to automatically generate the appropriatemapping knowledge in a batch process and to populate the database with the content ofthe XML documents by generating appropriate SQL insert-statements. Different mappingcardinalities are not further considered.

Access. Concerning access, the main purpose of LegoDB is on storage of XML doc-uments, a subset of XQuery is used as XML-centric access language at the XML-side,employing a very simple translation algorithm from XQuery to SQL.

6.3. MARS

Schemata. The system MARS (“Mixed And Redundant Storage”) [20,21] focuses, sim-ilar to [19], on a mixed XML and relational storage scenario, where redundancy can beexplored to enhance the performance of translating an XML query to the underlying datasource. MARS realises a user-defined approach, and handles also unstructured parts ofdata in terms of CLOBs.

Mapping. Mapping knowledge is hard-coded within views, defined towards both direc-tions, i.e., DB-side and XML-side, thereby realising not only a LAV approach (cf. Agora)but also a “global-as-view” (GAV) approach to data integration11. Similar to Agora,XML documents are represented in a relational way, using an intermediate relational metaschema called “Generic Relational representation of XML” (GReX). This intermediateschema consists of 8 tables and – different to Agora – of a set of relational constraints.Mapping from relational data to the published schema is expressed using XQuery. Map-ping from the published schema to the storage schema is done using relational constraints.Redundant data used for supporting queries is expressed by materialised SQL views overthe intermediate schema.

Access. The major focus of the system is on publishing. Access on the global XMLschema is provided by means of XQuery. Due to the combined LAV and GAV ap-proach, one and the same algorithm can be used for performing “rewriting-with-views,”“composition-with-views” or both.

6.4. MXM

Schemata. MXM (“My XML Mapper”) [3,4] is a declarative XML-to-relational mappinglanguage that allows to specify several mappings that have been proposed in literature.MXM takes into account both, XML documents without any schema and XML documents

Page 31: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 373

conforming either to a DTD or an XML Schema and generates a target relational schema,thus realizing a derived approach.

Mapping. Mapping knowledge is represented separately from the schemata and reifiedwithin XML documents conforming to an XML Schema. In addition, it is stored within arelational database consisting of 11 tables for describing the schema at the XML side, theschema at the DB side and the mapping in between. The main difference between MXMand X-Ray in this respect is the expressiveness of the meta schema. Since MXM adheresto a derived approach, heterogeneities between the schemata to be mapped can be reducedto a minimum using appropriate derivation rules. Therefore, the meta schema can be keptvery simple, which is in contrast to X-Ray, where several possible heterogeneities have tobe taken into account because of its user-defined approach. In MXM, there are some verysimple default mapping rules (e.g., table names and CLOB names are system-generated,if not explicitly given) which can be configured and extended by users within an XMLconfiguration file. Concerning mapping cardinality, although it seems to be possible torealise multiple mappings at both sides, it is only mentioned that multiple mappings intorelational back-ends are possible.

Access. Finally, MXM provides an interface in terms of a set of C-functions to querythe mapping schema, i.e., all choices made in a mapping, to generate the target relationalschema and to store XML documents in this schema.

6.5. SilkRoute

Schemata. SilkRoute [25,26] is an XML publishing framework that supports a user-defined approach, defining the XML schema using XQuery.

Mapping. The mapping process proceeds as follows: First, the relational schema is trans-formed automatically into a canonical XML view that represents the DB tables and theirattributes in XML format. Then, on the basis of this view an administrator defines a publicXML view, which is virtual, using a subset of XQuery. The public query represents themechanism to specify the schema at the XML-side together with the mapping knowledgedescribing how this schema is related to the canonical XML view. Thus, one part of themapping knowledge is hard-coded within the system to generate the canonical view, theother part is hard-coded within a query to define the public view, thereby allowing to definemultiple mappings at the XML-side, but a single mapping at the DB-side, only.

Access. Finally, users may access the public XML view by formulating applicationqueries using XQuery to publish XML documents out of the DB. Internally, XQuery ex-pressions are transformed into view forests, a representation that separates the structureof the output XML document from the computations that produce the document’s con-tent. The computations are expressed in SQL, whereby it is attempted to translate XQueryexpressions into an optimal set of corresponding SQL queries.

Page 32: Integrating XML and Relational Database Systems

374 G. KAPPEL ET AL.

6.6. XTABLES

Schemata/Mapping. The XTABLES approach [29,64] formerly known as “XPERANTO”[13] provides a middleware to publish XML views of relational data, to query XML views,and to store and query XML documents within an RDB. Regarding the publishing compo-nent, XTABLES automatically generates a derived default XML view out of the underly-ing relational data together with a corresponding XML Schema [60]. Based on this simpleview, more complex application-specific views can be derived in a hard-coded way usingXQuery and materialized on demand [60]. Concerning the storage of XML documents,one of possibly many relational schema generators automatically derive appropriate rela-tional tables for storage purposes. XML documents are shredded according to the schemageneration technique used (e.g., [28] or [59]) and stored within the tables. The schemagenerator as well as the document shredder have to be implemented manually, thus hard-coding mapping knowledge. In addition, according to the schema generation techniqueused, XTABLES generates a reconstruction XML view over these tables, representing infact a query over the default XML view of the created tables. Thus, XTABLES elimi-nates the need to build a new query processor for different relational schema generationtechniques.

Access. XML views can be queried using again XQuery, the result is computed by meansof a view composition mechanism, ensuring that only those relational data items neededare materialized and not any intermediate results. Furthermore, XQuery can be used toissue queries that span XML documents and XML views of relational data. In any case,most computation is pushed down to the RDB engine [61]. XTABLES can be used on topof any ODBC-compliant RDBS and has been already integrated into IBM’s DB2 named“XML for Tables” for the publishing component and “XML Data Mediator” [33] for thestorage component [34].

6.7. DB2 XML Extender

Schemata. IBM’s DB2 XML Extender [6] provides procedures to perform storage andretrieval of XML documents realizing a user-defined approach.

Mapping. Mapping knowledge is stored within data access definition (DAD) files, whichare in fact XML documents. DAD files support the so-called “XML Collection” approachto map XML documents to several database attributes and the so-called “XML Column”approach to store XML documents within a single database attribute. XML Collectionprovides two ways to define a mapping: “SQL Node mapping” allows to specify an SQLquery over the relational schema, thus hard-coding a part of the mapping knowledge, andthe mapping is defined to the relational schema of the query result, whereas “RDB Nodemapping” requires a mapping to the relational schema as it is. Since the XML schema isspecified within the DAD files using pre-defined elements and attributes, too, the mappingknowledge is intermingled with the XML schema specification. It is possible to definemultiple mappings at the XML-side, but a single mapping at the DB-side, only.

Page 33: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 375

Access. SQL Node mapping supports retrieval of XML documents, whereas RDB Nodemapping supports both, retrieval and storage of XML documents. Instead of explicitlyaccessing the DB-side or the XML-side, access is performed by applying procedures thattake a DAD as parameter. Conditions restricting the data may be specified within the DADfile.

6.8. Oracle

First of all, it has to be noted that, in contrast to the other commercial systems, Oracle9iR2 [32] builds on the object-relational data model for integrating XML. In this way nestedstructures of the XML-side are mapped to nested structures at the DB-side, making theprocess of mapping schemata more intuitive than approaches relying on the flat relationaldata model, hardening, however, to reuse existing relational data. In particular, Oraclesupports several mechanisms for integrating XML, comprising the data type XMLType12,the XML SQL Utility (XSU), and a canonical XML view over the relational DB.

XMLType

Schemata. Using XMLType to decompose XML documents an object-relational schemaat the DB-side is derived from a user-defined XML schema specified by using the XMLSchema standard or vice versa.

Mapping. Mapping knowledge is mainly hard-coded within the application, but may beenhanced by specifying changes to the default mapping rules intermingled within the XMLschema specification. Such changes are supported by pre-defined attributes, allowing torename elements and attributes, for instance. A single mapping is possible at both sides,only.

Access. Storage as well as retrieval of XML documents is supported by several functionsprovided by the data type XMLType allowing to reference elements and attributes by XPathexpressions. Although these functions are applied directly to XMLType columns of a rela-tional schema using SQL, conditions are formulated against the XML schema constitutingtherefore the access target.

XML SQL Utility (XSU)

Schemata. The XML SQL Utility (XSU) is actually a programming interface for Javaand PL/SQL for storage and publishing of XML into or out of, respectively, Oracle. XSUsupports a derived approach by applying default mapping rules to the result of an SQLquery. There exists no explicit schema specification at the XML-side, although it is pos-sible to generate a DTD or an XML Schema specification corresponding to the schemaderived from the query result.

Page 34: Integrating XML and Relational Database Systems

376 G. KAPPEL ET AL.

Mapping. The mapping knowledge is hard-coded within the application and within thequery, defined at access time, thus being coupled tightly with the structure at the XML-side. Therefore, it is possible to define multiple mappings at the XML-side, but a singlemapping at the DB-side, only.

Access. XSU supports storage as well as retrieval of XML documents by several meth-ods, having the DB-side as access target. In case of storage it is not necessary to distinguishwhether the DB-side or the XML-side is accessed, since they are required to have the samestructure.

Canonical XML view

Schemata. Further, Oracle supports a virtual canonical XML view over all schemata ofthe DB. Therefore, the schema at the XML-side is derived from the DB schema.

Mapping. The mapping knowledge is hard-coded within the application. Thus, there is aloose coupling to the mapped schemata. A single mapping is possible at both sides, only,resulting from applying a canonical mapping.

Access. The virtual XML view can be used to publish XML using a subset of XPath withinURLs.

6.9. SQL Server

Microsoft SQL Server 2000 [53,54] supports several different options for integrating XMLwith the SQL Server, comprising a FOR XML clause for SQL statements, an annota-tion mechanism for XML Schema, the OpenXML function to define relational views overXML, and a canonical XML view over the RDB13.

FOR XML

Schemata. The FOR XML clause is an extension to SQL and provides four modes totransform query results into XML. Depending on the mode a derived approach (RAW,AUTO, NESTED modes) as well as a user-defined approach (EXPLICIT mode) are sup-ported. Concerning the latter, the XML schema is defined by specifying a universal table,encoding information about the structure of the XML document using a special syntax forcolumn names of the query result. In this way arbitrary nesting levels are supported andit is allowed to determine for each attribute of the query result whether it is mapped to anelement or to an XML attribute. Further, additional information, like the attribute type IDfor an XML attribute, may be specified.

Mapping. In the derived case mapping knowledge is mainly hard-coded within the ap-plication, whereby changes to the default mapping rules may be hard-coded within the

Page 35: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 377

query. In the user-defined case mapping knowledge is hard-coded within the query, only.The mapping knowledge is defined at access time and not stored in any way. There is atight coupling between mapping knowledge and the specification of the structure for theXML document, allowing multiple mappings at the XML-side and a single mapping at theDB-side, only.

Access. The FOR XML clause is an extension to the SQL SELECT statement issuedagainst the RDB-side and thus, supports publishing of XML documents.

Annotated schemata

Schemata. Another way to map relational data to XML is given by annotated schematasupporting a user-defined approach. XML Schema as well as a variant invented by Mi-crosoft, called XML Data Reduced (XDR), can be used to define a virtual XML view.

Mapping. The mapping is defined by enhancing the XML schema definition with annota-tions, representing a tight coupling, that specify those relations and attributes, which shouldbe mapped to together with foreign-key relationships that have to be considered when re-trieving or storing, respectively, data from more than one relation. In this way, multiplemappings may be specified at the XML-side, but a single mapping may be specified at theDB-side, only.

Access. Annotated Schemata support both, retrieval as well as storage of XML docu-ments, by accessing the virtual XML view. Retrieval is supported by applying a subset ofXPath, whereas storage (insert, update, and delete) is supported by a special mechanism,called “Updategram.” Updategrams are XML documents that describe a “before state,”used for finding the corresponding element or attribute, and an “after state,” used for defin-ing the new values. Since Updategrams demand a special syntax it is not possible to insertarbitrary XML documents. Updategrams may also be used to update the canonical XMLview over the relational schema.

OpenXML

Schemata. Whereas the annotated schemata approach defines XML views over relationaldata, OpenXML works the other way round, allowing to define relational views over XML.For this, an XML document is transformed via DOM into a flat table structure. TheOpenXML function may be used within an SQL statement to insert the generated tuplesinto a relational table, for instance. Therefore, OpenXML provides the WITH clause thatallows to define the mapping knowledge. OpenXML supports a user-defined approachbetween the implicit schema of the XML document and the relational schema, actually asingle relation.

Mapping. The transformation is either done on the basis of default mapping rules orby explicitly defining the mapping knowledge within the WITH clause of the OpenXML

Page 36: Integrating XML and Relational Database Systems

378 G. KAPPEL ET AL.

function performing the transformation. For this, XPath expressions are used to referenceelements or XML attributes within the XML document and assign them to DB attributes.The mapping knowledge is coupled loosely with the schema specification and it is possibleto define multiple mappings at the XML-side as well as at the DB-side.

Access. The OpenXML function may be used to store XML documents by applying itwithin an SQL statement issued against the DB.

6.10. Summary of results

In the following, the major findings of our evaluation are briefly summarized according tothe design alternatives presented in Section 4 (cf. Table 7).

Kind of schemata. Concerning the kind of schemata, about half of the examined ap-proaches support a derived approach, either at the DB-side or at the XML-side, the othersallow – similar to X-Ray – a user-defined approach. Regarding the commercial systems,whereas Oracle provides a derived approach only, DB2 and SQLServer also allow a user-defined approach.

Representation of mapping knowledge. Regarding the representation of mapping knowl-edge, mapping transparency is only violated by a few approaches realized by commer-cial systems, requiring to define the mapping knowledge at runtime. Many approacheshard-code mapping knowledge at initialization time within queries, thereby hardening itsmaintenance. Only five approaches allow to reify the mapping knowledge within a metaschema, four of them within XML documents, only two of them, namely MXM and X-Ray,allow the storage within a DB.

Coupling with schemata. Most of the investigated approaches support loose couplingof the mapping knowledge with the schemata to be mapped, only SilkRoute and someapproaches of the commercial systems adhere to tight coupling.

Mapping cardinality. Considering the mapping cardinality, except SQLServer’s Open-XML and X-Ray, none of the evaluated approaches support multiple schema at the DB-sideand multiple schema at the XML-side. Seven approaches allow multiple mappings at theXML-side, four approaches multiple mappings at the DB-side and five of them allow asingle mapping at both directions, only.

Access capability. About one third of the approaches investigated aim at both, storageand publishing of XML documents within a unified approach.

Access language. Most research prototypes support XML-centric access, whereas com-mercial products realize also database-centric access.

Access target. Most of the approaches support the XML schema as target of the access,only few of them use the DB schema. Only two approaches – MXM and X-Ray – allow to

Page 37: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 379

query the mapping knowledge which is a result of the fact that both approaches store themapping knowledge within the DB.

Summing up, the most distinguishing characteristics of X-Ray with respect to most ofthe compared approaches are the following:

(1) X-Ray ensures mapping transparency and eases the maintenance of mapping knowl-edge by reifying the mapping knowledge as an instance of a meta schema, storedwithin a DB.

(2) X-Ray realizes multiple schemata at the DB-side and at the XML-side by allowingmultiple relationships between the reified concepts of the schemata to be defined withinthe meta schema.

(3) X-Ray supports a unified approach, equally allowing to store and to publish XMLdocuments on the bases of an RDBS.

(4) X-Ray provides different access targets, especially allowing to reason about the map-ping knowledge simply by using queries since the mapping knowledge is stored withinthe DB.

7. Summary and future work

This paper discusses several issues relevant when integrating XML and RDBS and pro-poses X-Ray, an approach that realizes such an integration in a generic way. First, thepaper provides an analysis of data model heterogeneity between XML, in terms of DTDsand XML Schema, and the relational data model. In particular, an in-depth investigation ofsimilarities and differences between XML concepts and RDBS concepts is provided, fo-cusing on structuring and typing mechanisms, uniqueness of names, null values and defaultvalues, identification, relationships, and order. Second, on the basis of different mappingpossibilities between XML and RDBS, a set of mapping patterns is introduced, determin-ing reasonable mappings of XML concepts to RDBS concepts by considering several char-acteristics of XML element types and XML attributes. Third, design alternatives relevantwhen integrating XML and RDBS are discussed, comprising the schemata to be mapped,the mapping itself and the access to the system. Fourth, X-Ray, a generic approach forintegrating XML with RDBS is proposed. X-Ray allows to realize user-defined mappingsbetween independently developed schemata supported by mapping patterns, thus preserv-ing the autonomy of the participating DTDs and relational schemata as well as a genericintegration thereof.

Concerning future work, besides working on a data manipulation language for X-Ray(cf. [12,31,48]), we are currently extending X-Ray towards the XML Schema standard.As discussed in Section 6, existing approaches support XML Schema with respect to thedesign goals of X-Ray in rather restricted ways. They either support a derived approachor enhance the schema language to incorporate the mapping knowledge intermingled withthe XML Schema specification. Consequently, they do not allow to map a schema to mul-tiple schemata at the other side. In addition, it has been shown that they simplify XML

Page 38: Integrating XML and Relational Database Systems

380 G. KAPPEL ET AL.

Schema concepts or do not support all concepts provided. Also, mapping patterns that de-scribe which XML Schema concepts might be mapped to which relational concepts are notprovided. Finally, neither reification of both schemata is supported nor the storage of themapping knowledge within a DB, thus preventing a more sophisticated model managementas suggested, e.g., by Bernstein et al. [8].

X-Ray will incorporate XML Schema according to the design goals discussed in Sec-tion 4. For this, a meta schema has to be designed, incorporating the concepts of the XMLSchema standard in an object-oriented way, just as already done concerning the XMLDTD component of our existing meta schema. To support the mapping process, patternsfor mapping XML Schema concepts to RDB concepts have to be proposed. These map-ping patterns have to be realized within the meta schema of X-Ray, to allow reasonablemappings, only. Concerning the definition of mapping patterns, we heavily build on ex-isting work of mapping object-oriented models to relational models (cf., e.g., [2,38,44]),since several concepts available in XML Schema are closely related to concepts avail-able in object-oriented models. At the same time, the rich set of concepts provided byXML Schema further increases the heterogeneity with respect to RDBS. One example ofthese concepts is the inheritance mechanism provided by the XML Schema standard foruser-defined simple/complex types which are used for defining simple/complex elements.We have already defined four different mapping patterns dealing with inheritance of user-defined complex types, representing one of the more complex scenarios when mappingXML Schema to RDBS. These patterns capture, on the one hand, three basic realizationalternatives for inheritance in RDBS as proposed in literature (cf., e.g., [5]) and cover, onthe other hand, the notion of dynamic binding, meaning that a composite element is ableto contain not only nested elements where the dynamic type equals the static one, but alsoelements where the dynamic type is derived from the static one.

Notes

1. Note, that in the following, we use the term relational database system (RDBS) in case that the relational datamodel is used as basis for integration with XML.

2. Note, that in the following, we use a capital letter to denote the XML Schema standard and a small letter todepict any XML schema.

3. It has to be noted that there are also hybrid approaches using, e.g., a mapping approach for the structuredparts of a document and the CLOB approach for the unstructured parts (cf., e.g., [20]).

4. Other forms of heterogeneity which would be also relevant in this context such as “semantic heterogeneity”[63] are not further dealt with in this paper.

5. Note, that the XML standard specification [68] does not provide any terminology for such a distinction.6. Note, that although in literature (cf., e.g., [18]) different meanings of null values are discussed (e.g., the value

is inapplicable, exists but is missing, or even its existence is unknown), these differentiations can neither beexpressed at the XML-side using DTDs or XML Schema nor at the DB-side using RDBS.

7. Mapping patterns for supporting also the XML Schema standard are currently under development (cf. Sec-tion 7).

8. Note, that to reach this goal, we had to carefully design the schemata with respect to each other, although thefocus of our approach is on schemata designed relatively independent to each other.

9. An extension of the meta schema in order to support also the XML Schema standard is currently underdevelopment (cf. Section 7).

Page 39: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 381

10. Note, that the aim of our approach is neither to store schema-less XML documents nor to provide for a“round-trip-engineering” of XML documents by preserving instance-level information such as comments,processing instruction, or document order. Rather, our aim is to deal with XML documents adhering to acertain schema and to allow them to be either generated out or stored within an RDBS.

11. For a discussion of the benefits and drawbacks of each approach, it is referred to [47].12. Note, that XMLType is used for both, storing XML documents as a whole within a single attribute and

shredding the XML documents into different attributes.13. This approach has already been discussed within the context of Oracle.

References

[1] S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web: From Relations to Semistructured Data andXML, Morgan Kaufmann, 2000.

[2] S. W. Ambler, “Mapping objects to relational data,” Ambysoft White Paper, 2003, http://www.ambysoft.com/mappingObjects.html [last access 2003–08–07].

[3] S. Amer-Yahia, and D. Srivastava, “A mapping schema and interface for XML stores,” in Fourth ACM CIKMInternational Workshop on Web Information and Data Management (WIDM’02), Virginia, November 2002.

[4] S. Amer-Yahia, M. Fernandez, R. Greer, and D. Srivastava, “Logical and physical support for heterogeneousdata,” in Eleventh Int. ACM Conference on Information and Knowledge Management (CIKM’02), Virginia,November 2002.

[5] P. Atzeni, S. Ceri, S. Paraboschi, and R. Torlone, Database Systems – Concepts, Languages and Architec-tures, McGraw Hill, 1999.

[6] S. E. Benham, “IBM XML-enabled data management product architecture and technology,” in XML DataManagement, Native XML and XML-Enable Database Systems, A. B. Chaudhri, A. Rashid, and R. Zicari(eds.), Addison-Wesley, 2003.

[7] T. Berners-Lee, R. Fielding, U. C. Irvine, and L. Masinter, “Uniform Resource Identifiers (URI): genericsyntax,” Network Working Group, August 1998, http://www.ietf.org/rfc/rfc2396.txt [lastaccess 2003–08–07].

[8] P. A. Bernstein, A. Y. Halevy, and R. A. Pottinger, “A vision for management of complex models,” ACMSIGMOD Record 29(4), 2000.

[9] P. Bohannon, J. Freire, J. Haritsa, M. Ramanath, R. Prasan, and J. Simeon, “Bridging the XML-relationaldivide with LegoDB: a demonstration,” in Proceedings of ICDE, 2003.

[10] R. Bourret, “XML and databases,” http://www.rpbourret.com/xml/XMLAndDatabases.htm, 2003 [last access 2003–08–07].

[11] R. Bourret, C. Bornhövd, and A. P. Buchmann, “A generic load/extract utility for data transfer betweenXML documents and relational databases,” in 2nd Int. Workshop on Advanced Issues of EC and Web-BasedInformation Systems (WECWIS), San Jose, CA, June 2000.

[12] V. Braganholo, S. Davidson, and C. Heuser, “On the updatability of XML views over relational databases,”in Proc. of the 6th Int. Workshop on the Web and Databases (WebDB), San Diego, CA, June 2003.

[13] M. Carey, D. Florescu, Z. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, and S. Subramanian,“XPERANTO: publishing object-relational data as XML,” in Proc. of the Third International Workshopon the Web and Databases (WebDB), in conjunction with ACM SIGMOD, Dallas, TX, May 2000.

[14] R. G. G. Cattell and D. K. Barry (eds.), The Object Data Standard: ODMG 3.0, Morgan Kaufmann, January2000.

[15] S. Ceri, P. Fraternali, and S. Paraboschi, “Design principles for data-intensive web sites,” ACM SIGMODRecord 24(1), 1999.

[16] S. Ceri, P. Fraternali, and S. Paraboschi, “XML: current developments and future challenges for the databasecommunity,” in Proc. of the 7th Int. Conf. on Extending Database Technology (EDBT), Konstanz, LectureNotes in Computer Science, Vol. 1777, Springer, 2000.

[17] D. Chamberlin, J. Robie, and D. Florescu, Quilt: An XML Query Language for Heterogeneous DataSources, Lecture Notes in Computer Science, Springer, December 2000.

Page 40: Integrating XML and Relational Database Systems

382 G. KAPPEL ET AL.

[18] E. F. Codd, “Missing information (applicable and inapplicable) in relational databases,” SIGMOD RECORD15(4), 1986.

[19] A. Deutsch, M. F. Fernandez, and D. Suciu, “Storing semistructured data in relations,” in Workshopon Query Processing for Semistructured Data and Non-Standard Data Formats, Jerusalem, January1999.

[20] A. Deutsch, M. F. Fernandez, and D. Suciu, “Storing semistructured data with STORED,” in Proc. of theInt. ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999.

[21] A. Deutsch and V. Tannen, “Reformulation of XML queries and constraints,” in Proc. of the 9th Interna-tional Conference on Database Theory (ICDT), Siena, Italy, January 2003.

[22] A. Deutsch and V. Tannen, “MARS: a system for publishing XML from mixed and redundant storage,” inProc. of the 29th Int. Conference on Very Large Databases (VLDB), Berlin, Germany, 2003.

[23] G. Ehmayer, G. Kappel, and S. Reich, “Connecting databases to the web – a taxonomy of gateways,” inProc. of the 8th Int. Conf. on Database and Expert Systems Applications (DEXA), Toulouse, Lecture Notesin Computer Science, Vol. 1308, Springer, September 1997.

[24] A. Eisenberg and J. Melton, “SQL/XML is making good progress,” SIGMOD Record 31(2), 2002.[25] M. F. Fernandez, W.-C. Tan, and D. Suciu, “SilkRoute: Trading between relations and XML,” in Proc. of

the 9th Int. World Wide Web Conf. (WWW), Amsterdam, May 2000.[26] M. F. Fernandez, Y. Kadiyska, A. Morishima, D. Suciu, and W.-C. Tan, “{SilkRoute}: a framework for

publishing relational data in {XML},” ACM Transactions on Database Technology 27(4), 2002.[27] D. Florescu, A. Levy, and A. Mendelzon, “Database techniques for the world wide web: a survey,” ACM

SIGMOD Record 27(3), 1998.[28] D. Florescu and D. Kossmann, “Storing and querying XML data using an RDBMS,” IEEE Data Engineer-

ing Bulletin 22(3), Special Issue on XML, 1999.[29] J. Funderburk, G. Kiernan, J. Shanmugasundaram, E. Shekita, and C. Wei, “XTABLES: bridging relational

technology and XML,” IBM Systems Journal 41(4), 2002.[30] R. Goldman, J. McHugh, and J. Widom, “From semistructured data to XML: migrating the Lore data model

and query language,” in Proc. of the 2nd Int. Workshop on the Web and Databases (WebDB), Philadelphia,PA, June, 1999.

[31] Ch. Hiebl, “Implementation of a declarative query and data manipulation language for X-Ray,” Masterthesis, Department of Information Systems, Johannes Kepler University of Linz, Austria, 2002.

[32] U. Hohenstein, “Supporting XML in Oracle9i. XML data management,” in Native XML and XML-EnableDatabase Systems, A. B. Chaudhri, A. Rashid, and R. Zicari (eds.), Addison Wesley, 2003.

[33] IBM, alphaWorks, “XML Data Mediator,” www.alphaworks.ibm.com/tech/XI [last access 2003–08–07].

[34] IBM, alphaWorks, “XML for tables,” www.alphaworks.ibm.com/tech/xtable [last access 2003–08–07].

[35] Infonyte XML database, http://www.infonyte.com [last access 2003–08–07].[36] L. Khan, Q. Chen, and Y. Rao, “A comparative study of storing XML data in relational database manage-

ment systems,” in Proc. of International Conference on Internet Computing, Las Vegas, NV, June 2002,pp. 277–282.

[37] C.-C. Kanne and G. Moerkotte, “Efficient storage of XML data,” in Proc. of the 16th Int. Conf. on DataEngineering (ICDE), San Diego, March 2000.

[38] G. Kappel, S. Preishuber, E. Pröll, S. Rausch-Schott, W. Retschitzegger, R. R. Wagner, and Ch. Gierlinger,“COMan – coexistence of object-oriented and relational technology,” in Proc. of the 13th Int. Conf. on theEntity–Relationship Approach (ER), Manchester, December 1994.

[39] G. Kappel, E. Kapsammer, S. Rausch-Schott, and W. Retschitzegger, “X-Ray – towards integrating XMLand relational database systems,” in Proc. of the 19th Int. Conf. on Conceptual Modeling (ER), Salt LakeCity, USA, Lecture Notes in Computer Science, Vol. 1920, Springer, 2000.

[40] G. Kappel, E. Kapsammer, and W. Retschitzegger, “Architectural issues for integrating XML and relationaldatabase systems – the X-Ray approach,” in Proc. of the Workshop on XML Technologies and SoftwareEngineering (XSE), 23rd Int. Conf. on Software Engineering (ICSE), Toronto, Canada, May 2001.

Page 41: Integrating XML and Relational Database Systems

INTEGRATING XML AND RELATIONAL DATABASE SYSTEMS 383

[41] G. Kappel, E. Kapsammer, and W. Retschitzegger, “XML and relational database systems – a comparisonof concepts,” in Proc. of the 2nd Int. Conf. on Internet Computing (IC), CSREA Press, Las Vegas, USA,June 2001.

[42] G. Kappel, E. Kapsammer, and W. Retschitzegger, “X-Ray – towards integrating XML and rela-tional database systems,” Technical Report, Department of Information Systems (IFS), Johannes KeplerUniversity of Linz, Austria, July 2000, http://www.ifs.uni-linz.ac.at/ifs/research/publications/papers00.html [last access 2003–08–07].

[43] G. Kappel, B. Pröll, W. Retschitzegger, and W. Schwinger, “Customisation for ubiquitous web applica-tions – a comparison of approaches,” Int. Journal of Web Engineering and Technology (IJWET), InauguralVolume, 2003.

[44] W. Keller, “Mapping objects to tables – a pattern language,” in Second European Conference on PatternLanguages of Programming (EuroPlop), Irsee, Germany, July 1997.

[45] R. Krishnamurthy, R. Kaushik, and J. Naughton, “XML–SQL query translation literature: the state ofthe art and open problems,” in The First XML Database Symposium (XSym03), held in conjunction withVLDB2003, Berlin, September 2003.

[46] I. Manolescu, D. Florescu, D. Kossmann, F. Xhumari, and D. Olteanu, “Agora: living with XML andrelational,” in Proc. of the 26th Int. Conf. on Very Large Data Bases (VLDB), Cairo, Egypt, 2000.

[47] I. Manolescu, D. Florescu, and D. Kossmann, “Answering XML queries over heterogeneous data sources,”in Proc. of the Int. Conf. on Very Large Databases (VLDB), Roma, Italy, 2001.

[48] D. Obasanjo and S. B. Navathe, “A proposal for an XML data definition and manipulation language,” inProc. of the Workshop on Efficiency and Effectiveness of XML Tools and Techniques and Data Integrationover the Web (EEXTT), in conjunction with VLDB 2002, Honkong, Lecture Notes in Computer Science,Vol. 2590, Springer, 2002.

[49] Poet Software Corporation, www.poet.com [last access 2003–08–07].[50] B. Pröll, H. Sighart, W. Retschitzegger, and H. Starck, “Ready for prime time – pre-generation of web

pages in TIScover,” in Proc. of the 8th Int. ACM Conference on Information and Knowledge Management(CIKM), Kansas City, MS, November 1999.

[51] M. Ramanath, J. Freire, J. Haritsa, and P. Roy, “Searching for efficient XML-to-relational mappings,” inXML Database Symposium (XSym), in conjunction with VLDB 2003, Berlin, Germany, 2003.

[52] J. Raumbaugh, I. Jacobson, and G. Booch, The Unified Modeling Language Reference Manual, Addison-Wesley, 1999.

[53] M. Rys, “State-of-the-art support in RDBMS: Microsoft SQL server’s XML features,” IEEE Data Engi-neering Bulletin 24(2), 2001.

[54] M. Rys, “XML support in Microsoft SQL server 2000,” in XML Data Management, Native XML and XML-Enable Database Systems, A. B. Chaudhri, A. Rashid, and R. Zicari (eds.), Addison-Wesley, 2003.

[55] A. Sahuguet, “Kweelt, the making-of: mistakes made and lessons learned,” Technical Report, Depart-ment of Computer and Information Science, University of Pennsylvania, http://db.cis.upenn.edu/DL/kweelt-TR.pdf, November 2000 [last access 2003–08–07].

[56] A. R. Schmidt, M. L. Kersten, M. A. Windhouwer, and F. Waas, “Efficient relational storage and retrievalof XML documents,” in Workshop on the Web and Databases (WebDB), Dallas, May 2000.

[57] H. Schöning and J. Wäsch, “Tamino – an Internet database system,” in Proc. of the 7th Int. Conf. on Ex-tending Database Technology (EDBT), Konstanz, Lecture Notes in Computer Science, Vol. 1777, Springer,2000.

[58] M. Schrefl, M. Bernauer, E. Kapsammer, B. Pröll, W. Retschitzegger, and T. Thalhammer, “Self-maintaining web pages,” International Journal of Information Systems (IS) 28(8), 2003, 1005–1036.

[59] J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton, “Relational databases forquerying XML documents: limitations and opportunities,” in VLDB Conference, September 1999.

[60] J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, and B. Reinwald, “Efficientlypublishing relational data as XML documents,” VLDB Journal 10(2–3), 2001.

[61] J. Shanmugasundaram, J. Kiernan, E. Shekita, C. Fan, and J. Funderburk, “Querying XML views of rela-tional data,” in VLDB Conference, September 2001.

Page 42: Integrating XML and Relational Database Systems

384 G. KAPPEL ET AL.

[62] K. Shoens, A. Luniewski, P. Schwarz, J. Stamos, and J. Thomas, “The Rufus system: Information organiza-tion for semi-structured data,” in Proc. of the Int. Conf. on Very Large Data Bases (VLDB), Dublin, Ireland,1993.

[63] S. Spaccapietra, C. Parent, and Y. Dupont, “Model independent assertions for integration of heterogeneousschemas,” VLDB Journal 1(1), 1992, 81–126.

[64] I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, and E. Shekita, “Storing and querying orderedXML using a relational database system,” in SIGMOD Conference, June 2002.

[65] F. Tian, D. J. DeWitt, J. Chen, and C. Zhang, “The design and performance evaluation of alternative XMLstorage strategies,” Sigmod Record 31(1), 2002.

[66] J. Widom, “Data management for XML – research directions,” IEEE Data Engineering Bulletin 22(3),Special Issue on XML, 1999.

[67] World Wide Web Consortium (W3C), “Namespaces in XML,” W3C Recommendation, January 1999,http://www.w3.org/TR/1999/REC-xml-names-19990114/ [last access 2003–08–07].

[68] World Wide Web Consortium (W3C), “Extensible Markup Language (XML) 1.0 (2nd edition),” W3C Rec-ommendation, October 2000, http://www.w3.org/TR/2000/REC-xml-20001006 [last access2003–08–07].

[69] World Wide Web Consortium (W3C), “XML Schema,” W3C Recommendation, May 2001, http://www.w3.org/XML/Schema [last access 2003–08–07].

[70] World Wide Web Consortium (W3C), “XML Path Language (XPath) 1.0,” W3C Recommendation, No-vember 1999, http://www.w3.org/TR/xpath [last access 2003–08–07].

[71] World Wide Web Consortium (W3C), “XQuery 1.0: An XML Query Language,” W3C Working Draft, May2003, http://www.w3.org/TR/xquery [last access 2003–08–07].

[72] World Wide Web Consortium (W3C), “XQuery 1.0 and XPath 2.0 Data Model,” http://www.w3.org/TR/xpath-datamodel [last access 2003–08–07].