naming in xml documents
DESCRIPTION
Naming in XML Documents. Dr. Ramon Lawrence IDEA Lab University of Iowa [email protected]. Outline. Motivation Overall Goals Background Naming and Ontologies Semantic Naming of XML Elements Semantic Querying of Named XML Documents Support for Document Evolution and Linking - PowerPoint PPT PresentationTRANSCRIPT
Page 1
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Naming in XML Documents Naming in XML Documents
Dr. Ramon LawrenceDr. Ramon LawrenceIDEA LabIDEA Lab
University of IowaUniversity of [email protected]@uiowa.edu
Dr. Ramon LawrenceDr. Ramon LawrenceIDEA LabIDEA Lab
University of IowaUniversity of [email protected]@uiowa.edu
Page 2
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Outline
Motivation Overall Goals Background Naming and Ontologies Semantic Naming of XML Elements Semantic Querying of Named XML Documents Support for Document Evolution and Linking Future Work and Conclusions
Page 3
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Motivation
Motivation #1 - Naming is important despite limited research focus.
Names are a gateway to structure, but can also be used to avoid structure.
Users understand names better than structure, but naming is not considered in many models.
Motivation #2 - XML querying can be improved by minimizing use of path expressions.
XML query languages are complex and highly structured-based (even more than SQL).
Path expressions are similar to navigating in hierarchical models which was proven undesirable.
Queries cannot adapt to document changes.
Page 4
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Are Names Really That Important?
<!ELEMENT LM (M+)><!ELEMENT M (MN, MO+)><!ELEMENT MN (#PCDATA)><!ELEMENT MO (N, Y, F, S, R, V+)><!ELEMENT N (#PCDATA)><!ELEMENT Y (#PCDATA)><!ELEMENT F (#PCDATA)><!ELEMENT S (#PCDATA)><!ELEMENT R (#PCDATA)><!ELEMENT V (C, P, VN, O+)><!ELEMENT C (#PCDATA)><!ELEMENT P (#PCDATA)><!ELEMENT VN (#PCDATA)><!ELEMENT O (#PCDATA)>
Poorly Named DTD
<!ELEMENT list-manufacturer (manufacturer+)><!ELEMENT manufacturer (mn-name, model+)><!ELEMENT mn-name (#PCDATA)><!ELEMENT model (mo-name, year, front-ratingside-rating, rank, vehicle+)><!ELEMENT mo-name (#PCDATA)><!ELEMENT year (#PCDATA)><!ELEMENT front-rating (#PCDATA)><!ELEMENT side-rating (#PCDATA)><!ELEMENT rank (#PCDATA)><!ELEMENT vehicle (color, price, vendorName,
option+)><!ELEMENT color (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT vendorName (#PCDATA)><!ELEMENT option (#PCDATA)>
DTD with Decent Naming
Page 5
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Overall Goal
The overall goal is to develop a naming methodology for XML tags that has two desirable properties:
1) Provides more semantics and context information to users.
2) Allows semantic querying of XML documents to simplify query formulation and handle document evolution.
The naming methodology must NOT enforce a strict standard on naming, but encourage better naming by providing a useful technique.
Page 6
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA LabBackground
XML Tag Names and Standards
The development of standard tag sets for given problem domains has been the focus of many organizations.
ebXML, RosettaNet, CML, XFRML, MathML
Our goal is not to define THE tag set for all XML, but rather suggest a methodology for constructing tag sets.
Applicable to the Semantic Web effort.
Page 7
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA LabBackground
XML Querying
There has been many XML query languages proposed:
LOREL, XML-QL, XML-GL, XSL, XQL
Even the graphical XML query language, XML GL, only supports querying with path expressions.
Why would we go back in time and make querying harder for the user?
The relational model replaced the hierarchical model because of its declarative query syntax.
Page 8
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Running Example
Page 9
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Converting the ER Model to XML
Modeling in XML requires a decision on how to hierarchically organize the information in the XML document.
Once selected, the hierarchical organization becomes the only view of the data and requires the user to formulate queries based on the hierarchy chosen.
Nesting of elements in XML has ambiguous semantics as the nesting may represent:
specialization/generalization (IS-A), Part-Of/HAS-A, ordering, grouping, general relationship (join)
Without tag names, impossible to determine relationship between nested elements.
Page 10
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Two XML DTDs for ER Diagram (1)
<!ELEMENT list-manufacturer (manufacturer+)><!ELEMENT manufacturer (mn-name, model+)><!ELEMENT mn-name (#PCDATA)><!ELEMENT model (mo-name, year, front-rating, side-rating, rank, vehicle+)><!ELEMENT mo-name (#PCDATA)><!ELEMENT year (#PCDATA)><!ELEMENT front-rating (#PCDATA)><!ELEMENT side-rating (#PCDATA)><!ELEMENT rank (#PCDATA)><!ELEMENT vehicle (color, price, vendorName, option+)><!ELEMENT color (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT vendorName (#PCDATA)><!ELEMENT option (#PCDATA)>
DTD1
Page 11
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Two XML DTDs for ER Diagram (2)
<!ELEMENT list-vendor (vendor+)><!ELEMENT vendor (vendorName, vehicle+)><!ELEMENT vendorName (#PCDATA)><!ELEMENT vehicle (color, price, op-name+,mn-name, model)><!ELEMENT color (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT op-name (#PCDATA)><!ELEMENT model (mo-name, year, front-rating, side-rating, rank)><!ELEMENT mo-name (#PCDATA)><!ELEMENT year (#PCDATA)><!ELEMENT front-rating (#PCDATA)><!ELEMENT side-rating (#PCDATA)><!ELEMENT rank (#PCDATA)>
DTD2 Differences:1) Different hierarchical organization
mn-name2) Different modeling of manufacturer
3) Naming differencesop-name
Page 12
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
A Simple Query on Both DTDs Query:
Return the manufacturer name and vehicle price for all vehicles with price < $30,000 and the vehicle model is in the top 10 for safety tests.
DTD1:
DTD2:
select M.mn-name, M.model.vehicle.pricefrom list-manufacturer.manufacturer Mwhere M.model.rank <= 10 and
M.model.vehicle.price < 30000
select V.mn-name, V.price from list-vendor.vendor.vehicle V where V.model.rank <= 10 and V.price < 30000
Page 13
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Ontologies and Naming
Assume the existence of some ontology to extract terms with definitions.
May use WordNet or problem-specific ontology.
Assumption: Human users have a “built-in” ontology, or view of the world, based on their experience and knowledge of the language.
By selecting common terms from a shared dictionary (language), both the producer (XML document source), and consumer (XML document user) will understand the semantics of a data element by terms used to defined the name.
Caveat: Understanding is to some degree of accuracy. (Hopefully >= 90%).
Page 14
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Ontologies and Naming (2)
Assumption #2: As more context information is provided by the producer (in the form of additional terms), the consumer is more confident that their world view is consistent with that of the the producer.
Consumer understands the producers view even if they originally do not share the same view.
Important: At no time is their intelligence demonstrated by software. The intelligence is embedded into the names assigned by the producer, and extracted by the consumer.
The system never needs to build its own world view to aid the users in reconciling theirs.
Page 15
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
A semantic name is a tag name for an XML element of the following form:
semantic_name ::= [CT_Term] | [CT_Term].PN CT_Term ::= CT | CT ; CT_Term | CT , CT_Term CT ::= <dictionary term> PN ::= <dictionary term>
A semantic name is intended to capture structure-independent semantics by combining multiple dictionary terms.
Structure of a Semantic Name
Page 16
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
DTD1 with Semantic Naming
<!ELEMENT V (Manufacturer+)><!ELEMENT Manufacturer (Manufacturer--Name, Manufacturer-Model+)><!ELEMENT Manufacturer--Name (#PCDATA)><!ELEMENT Manufacturer-Model (Manufacturer-Model--Name, Manufacturer-Model--Year, Manufacturer-Model-NHSCTest--FrontRating, Manufacturer-Model-NHSCTest--SideRating, Manufacturer-Model-NHSCTest--Rank, Vehicle+><!ELEMENT Manufacturer-Model--Name (#PCDATA)><!ELEMENT Manufacturer-Model--Year (#PCDATA)><!ELEMENT Manufacturer-Model-NHSCTest--FrontRating (#PCDATA)><!ELEMENT Manufacturer-Model-NHSCTest--SideRating (#PCDATA)><!ELEMENT Manufacturer-Model-NHSCTest--Rank (#PCDATA)><!ELEMENT Vehicle (Vehicle--Color, Vehicle--Price,Vendor--Name,Vehicle-Option--Name+)><!ELEMENT Vehicle--Color (#PCDATA)><!ELEMENT Vehicle--Price (#PCDATA)><!ELEMENT Vehicle-Option--Name (#PCDATA)><!ELEMENT Vendor--Name (#PCDATA)>
<!ELEMENT Manufacturer-Model-NHSCTest--Rank (#PCDATA)>
<!ELEMENT Vehicle--Price (#PCDATA)> Name is context-independent.
Page 17
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Semantic Querying
Using semantic tag names introduces a tradeoff between increased semantic description and longer tag names.
Path expressions are difficult to formulate and complicate XML querying.
Since semantic names are structure independent, queries can be posed without using path expressions.
Page 18
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
A Context View A context view is a structure-independent
hierarchy of concepts in the XML document.
The hierarchy is constructed automatically from the tag names in the XML document/DTD.
User’s query on the context view, and their queries are mapped to LOREL queries on the XML documents.
Page 19
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Building the Context View
[Manufacturer]
Vehicle Manufacturer
Vendor
Name
Name
Option
ColorPrice
RankFront Rating Side Rating
NHSC TestName Year
NameModel
[Manufacturer].Name[Manufacturer;Model][Vehicle]
Page 20
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Querying the Context View
Vendor
Vehicle Manufacturer
Name
Name
Option
ColorPrice
RankFront Rating Side Rating
NHSC TestName Year
NameModel
Return the manufacturer name and vehicle price for vehicles with price < $30,000 and the vehicle model is in the top 10 for safety tests.
(return)
Manufacturer
Name
< 30000, (return)
Vehicle
Price
<= 10
Rank
NHSC Test
Model
Page 21
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Mapping to DTD1
mn-name
list-manufacturer
manufacturer
vehicle
vendorName
model
front-ratingside-rating
mo-name rank
year
option
price
color
Page 22
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Semantic Naming in DTD1
[Manufacturer].Name
V
[Manufacturer]
[Vehicle]
[Vendor].Name
[Manufacturer;Model]
*(FR)[Manufacturer;Model;NHSCTest].SideRating
[Manufacturer;Model].Name [Manufacturer;Model;NHSCTest].Rank
[Manufacturer;Model].Year
[Vehicle;Option].Name
[Vehicle].Price
[Vehicle].Color
Page 23
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Query Mapping to DTD1
[Manufacturer].Name
V
[Manufacturer]
[Vehicle]
[Vendor].Name
[Manufacturer;Model]
*(FR)[Manufacturer;Model;NHSCTest].SideRating
[Manufacturer;Model].Name [Manufacturer;Model;NHSCTest].Rank
[Manufacturer;Model].Year
[Vehicle;Option].Name
[Vehicle].Price
[Vehicle].Color
[Manufacturer].Name
(return)[Manufacturer;Model;NHSCTest].Rank
[Vehicle].Price<30000, return
Page 24
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Mapping to DTD2
vendorName
list-vendor
vendor
model
side-rating
vehicle
price
op-name mn-name
color
front-rating
year
mo-name
rank
Page 25
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Mapping to DTD2
[Vendor].Name
V
[Vendor]
[Manufacturer;Model]
[Manufacturer;Model;NHSCTest].SideRating
[Vehicle]
[Vehicle].Price
[Vehicle;Option].Name [Manufacturer].Name
[Vehicle].Color
[Manufacturer;Model;NHSCTest].FrontRating
[Manufacturer;Model].Year
[Manufacturer;Model].Name
[Manufacturer;Model;NHSCTest].Rank
[Manufacturer].Name
(return)[Vehicle].Price
< 30000,return
[Manufacturer;Model;NHSCTest].Rank
<= 10
Page 26
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Mapping Algorithm
Perform a breadth-first traversal of DTD x to build a mapping table T.
Each entry in T contains a tag name tn, and a set of path expressions P. Each p in P provides a path in DTD x to element named tn.
If DTD x is a tree, each tn has a unique path. If DTD x is a graph, there may be multiple possible paths.
Can return path union or get user input.
After all path mappings have been determined, build a spanning tree connecting paths.
Unique spanning tree for tree DTDs, may have multiple spanning trees for graph DTDs.
Page 27
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Conclusions Naming is important because names for structures
are a user’s first contact with structural data representations.
Naming can be exploited to hide the structure by embedding more information into names.
The names assigned to XML elements have been standardized within organizations, but no work has been done on examining what constitutes good names.
By using names that are structure-independent, semantic querying is possible.
Semantic querying does not use path expressions. Semantic queries support document evolution.
Page 28
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Future Work
Test performance and cost with renaming on real-world XML document sets.
Does the increased XML document size affect query performance?
Develop formal query algebra for semantic queries.
Page 29
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
References Publications:
Using Unity to Semi-Automatically Integrate Relational Schema, Demonstration at ICDE’2002.
Querying Relational Databases without Explicit Joins, R. Lawrence and K. Barker, DASWIS 2001.
Integrating Relational Database Schemas using a Standardized Dictionary, SAC’2001 - ACM Symposium on Applied Computing, March, 2001.
Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pg 127-136, Oct. 2000.
Further Information: http://www.cs.uiowa.edu/~rlawrenc/ http://idealab.cs.uiowa.edu
Page 30
Naming in XML Documents - ODBASE’02 Ramon Lawrence, IDEA Lab
Extra Slides
Extra Slides...