1 indexing and querying xml data for regular path expressions a paper by quanzhong li and bongki...
Post on 21-Dec-2015
222 views
TRANSCRIPT
1
Indexing and Querying XML Data for Regular Path Expressions
A Paper by Quanzhong Li and Bongki Moon
Presented by Amnon Shochot
2
Our Objective
• Developing a system that will enable us to perform XML data queries efficiently.
3
XML Queries Languages
• Used for retrieving data from XML files.
• Use a regular path expression syntax.
• e.g. XPath, XQuery.
4
Queries Today - Inefficient
• Usually XML tree traversals – Inefficient.– Top-Down Approach– Bottom-Up Approach– An example:
the query:
/chapter/_*/figure
(finding all figures in all chapters.)
5
Our Objective - Refined
• Developing a system that will enable us to perform XML data queries efficiently
• Developing such a system consists of:– Developing a way to efficiently store XML data.– Developing efficient algorithms for processing
regular path expressions (e.g. XQuery expressions).
6
Storing XML Documents
• Question: What would we need from a data structure to be able to perform an efficient query?
• Answer: A mechanism for:– Efficiently finding all elements/attributes with a
given name.– Efficiently finding all values with a given name.– Efficiently resolving ancestor-descendant
relationship.
7
Storing XML Documents - XISS
• XISS - XML Indexing and Storage System.
• Provides us with ways to:– efficiently find all elements or attributes with the
same name string grouped by document which they belong to.
– quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.
8
Determining Ancestor-Descendent Relationship
• According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.
• Example:
9
Determining Ancestor-Descendent Relationship – cont.
• Advantage: the ancestor-descendent relationship can be determined in constant time.
• Disadvantage: a lack of flexibility.– e.g. inserting a new node requires recomputation
of many tree nodes.
10
• A new numbering scheme:– Each node is associated with a <order, size> pair:
• For a tree node y and its parent x:
[order(y), order(y) + size(y)] (order(x), order(x) + size(x)]
• For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:
order(x) + size(x) < order(y).
Determining Ancestor-Descendent Relationship – cont.
exclusive
11
Determining Ancestor-Descendent Relationship – cont.
• Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:
order(x) < order(y) order(x) + size(x)
12
Determining Ancestor-Descendent Relationship – cont.
• Properties:– the ancestor-descendent relationship can be
determined in constant time.– flexibility – node insertion usually doesn’t require
recomputation of tree nodes.– an element can be uniquely identified in a
document by its order value.
13
XISS System Overview
14
XISS System Overview
• How the system works:– XML documents are loaded into the XISS system.– These documents are added to the XISS data
structures.• Each document is assigned a document id (did).
• Index structures are organized as paged files for efficient disk IO.
– When a query is performed the query processor interacts with XISS in order to obtain the information required for the query.
15
XISS - cont.
• XISS consists of 5 components:– Name Index– Value Table– Element Index– Attribute Index– Structure Index
16
Name Index and Value Table
• Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.
• Name Index - mapping distinct name strings into unique name identifiers (nid).
• Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).
• Both implemented as a B+-tree.
17
The Element Index
• Objective: quickly finding all elements with the same name string.
• Structure:
18
The Element Index – cont.
• Structure:– B+-tree using nid as a key.– Leaf nodes: pointers to a set of records for elements
(or attributes) having an identical name string, grouped by the document they belong to.
– Element Record = {<order,size>, Depth, Parent ID}• where Depth is the depth of the element in the XML tree.
– Element Records are ordered by <order,size>.
19
The Attribute Index
• Objective: quickly finding all elements with the same name string.
• Structure:– Same structure as the Element Index except that the
record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.
20
The Structure Index
• Objectives:– Finding the parent element and child elements (or
attributes) for a given element.– Finding the parent element for a given attribute.
• Structure:
21
The Structure Index – cont.
• Structure:– B+-tree using document identifier (did) as a key.– Leaf nodes: linear arrays with records for all
elements and attributes from an XML document.– Each record: {nid, <order,size>, Parent order, Child
order, Sibling order, Attribute order}.– Records are ordered by order value.
22
Querying Method
• Decomposing path expressions into simple path expressions.
• Applying algorithms on simple path expressions and their intermediate results.
23
Decomposition of Path Expressions
• The main idea: – A complex path expression is decomposed into
several simple path expressions.– Each simple path expression produces an
intermediate result that can be used in the subsequent stage of processing.
– The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.
24
Basic Subexpressions - Example
Decomposition of
(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):
(1 )Single Element/Attribute
(2 )Element-Attribute
(3 )Element-Element
(4 )Kleene Closure
(5 )Union/
_/*/
* |
] [/
/
(4)
(2)
(3)
(5)
(3)
(3)
(3)
(1) (1) (1)(1) (1) (1)(1)
25
Basic Subexpressions
5 basic subexpressions:
(1) A subexpression with a single element or a single attribute.
(2) A subexpression with an element and an attribute.
• e.g. figure[@caption = “Tree Frogs”]
(3) A subexpression with two elements• e.g. chapter/_*/figure where ‘_’ denotes any kind of
node.
26
Basic Subexpressions - cont.
5 basic subexpressions - cont.:
(4) A subexpression that is a Kleene closure (+,*) of another subexpression.
(5) A subexpression that is a union of two other subexpressions.
27
3 Algorithms
• 3 Algorithms:– EA-Join: Element and Attribute Join.– EE-Join: Element and Element Join– Kleene Closure
28
EA-Join: Element and Attribute Join
Input:
{E1,…,Em}: Ei is a set of elements having a common document identifier (did);
{A1,…,An}: Aj is a set of elements having a common document identifier (did);
Output:
A set of (e,a) pairs such that the element e is the parent of the attribute a.
29
EA-Join: Element and Attribute Join
The Algorithm:
// Sort-merge {Ei} and {Aj} by did.
(1) foreach Ei and Aj with the same did do:
// Sort-merge Ei and Aj by
// PARENT-CHILD relationship
(2) foreach e Ei and a Aj do
(3) if (e is a parent of a) then output (e,a)
end
end
30
EA-Join – Example
• Consider the XML document:
<Ele Att=“A1”>
<Ele Att=“A2”> </Ele>
</Ele>
• And the query: /Ele[@Att=“A1”]
Ele <1,3>
Ele <3,1>
Att <4,0>
Att <2,0>
31
<Ele Att=“A1”>
<Ele Att=“A2”> </Ele>
</Ele>
• Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:<1,3>, <2,0>, <3,1>, <4,0>
• Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.
EA-Join – Querying /Ele[@Att=“A1”]
Ele <1,3>
Ele <3,1>
Att <4,0>
Att <2,0>
32
EA-Join – Comments
• Only a two-stage sort-merge operation without additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child relationship.
• This merge is based on the order values of the element and attribute as defined by the numbering scheme.
• Attributes should be placed before their sibling elements in the order of the numbering scheme.– guarantees that elements and attributes with the same did
can be merged in a single scan.
33
EE-Join: Element and Element Join
Input:
{E1,…,Em} and {F1,…,Fm}: Ei or Fj is a set of elements having a common document identifier (did).
Output:
A set of (e,f) pairs such that element e is an ancestor of element f.
34
EE-Join: Element and Element Join
The Algorithm:
// Sort-merge {Ei} and {Fj} by did.
(1) foreach Ei and Fj with the same did do:
// Sort-merge Ei and Fj by the
// ANCESTOR-DESCENDANT relationship.
(2) foreach e Ei and f Fj do
(3) if (e is an ancestor of f) then output (e,f);
end
end
35
EE-Join – Comments
• Only two-stage sort-merge operation without the additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child
relationship.
• The sets of elements with a matching did cannot be merged in a single scan.
36
Kleene Closure
Input:
{E1,…,Em}, where Ei is a group of elements from an XML document.
Output:
A Kleene closure of {E1,…,Em}.
37
The Algorithm:
(1) Set i 1;
(2) Set KiC {E1,…,Em};
(3) repeat
(4) set i i + 1;
(5) set KiC EE-Join(Ki-1
C, K1C);
until (KiC is empty);
(6) output the union of K1C,K2
C,…, KiC;
Kleene Closure
38
Performance Experiments
• EE-Join:
• Results: – Real World: an order of magnitude faster.– Synthetic Data: 6 to 10 times faster.
39
Performance Experiments
• EA-Join:
• Results:– Compared to Top-Down: a better performance.– Compared to Bottom-Up: no winner - close results.
40
Performance Results - Conclusions
• The proposed algorithms can achieve performance improvement over the conventional methods (top-down and bottom-up tree traversals) by up to an order of magnitude.