introduction to xml algebra
DESCRIPTION
Introduction To XML Algebra. Wan Liu Bintou Kane Advanced Database Instructor: Elka 2/11/2002 1. Outline. Reasons for XML algebra Niagara algebra AT&T Algebra. Data Model and Design. We need a clear framework to design a database - PowerPoint PPT PresentationTRANSCRIPT
1
Introduction To XML Algebra
Wan LiuBintou KaneAdvanced Database Instructor: Elka
2/11/20021
2
Outline
Reasons for XML algebra Niagara algebra AT&T Algebra
3
Data Model and Design We need a clear framework to design a
database A data model is like creating different
data structures for appropriate programming usage. It is a type system, it is abstract.
Relational database is implemented by tables, XML format is a new one method for information integration.
4
Why XML Algebra? It is common to translate a query
language into the algebra. First, the algebra is used to give a
semantics for the query language. Second, the algebra is used to
support query optimization.
5
XML Algebra HistoryLore Algebra (August 1999)
-- Stanford University
IBM Algebra (September 1999) --Oracle; IBM; Microsoft Corp
YAT Algebra (May 2000)
AT&T Algebra (June 2000) --AT&T; Bell Labs
Niagara Algebra (2001) -- University of Wisconsin -Madison
6
NIAGARA Title : Following the paths of XML
Data: An algebraic framework for XML query evaluation
By : Leonidas Galanis, Efstratios Viglas, David J. DeWitt, Jeffrey. F. Naughton, and David Maier.
7
OutLine Concepts of Niagara Algebra
Operations
Optimization
8
Goals of Niagara Algebra
Be independent of schema information Query on both structure and content Generate simple,flexible, yet powerful
algebraic expressions Allow re-use of traditional optimization
techniques
9
Example: XML Source Documents
Invoice.xml
<Invoice_Document>
<invoice No = 1>
<account_number>2 </account_number>
<carrier>AT&T</carrier>
<total>$0.25</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>Sprint</carrier>
<total>$1.20</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>AT&T</carrier>
<total>$0.75</total>
</invoice>
</Invoice_Document>
Customer.xml
<Customer_Document>
<customer>
<account>1 </account>
<name>Tom </name>
</customer >
<customer>
<account>2 </account>
<name>George </name>
</customer >
</Customer _Document>
10
XML Data Model and Tree Graph
Example:Invoice_Document
Invoice Invoice…
numbercarrier total number
carriertotal
2 AT&T $0.25 1 Sprint $1.20
<Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0.25</total> </invoice>
<invoice><number>1</number> <carrier>Sprint</carrier> <total>$1.20</total> </invoice>
</Invoice_Document>
Ordered Tree Graph,
Semi structured Data
11
XML Data Model [GVDNM01]
Collection of bags of vertices. Vertices in a bag have no order. Example:
Root invoice.xml invoice invoice.account_number
<invoice>Invoice-element-content
</invoice>
< account_number >element-content
</ account_number >
[Root“invoice.xml”, invoice, invoice. account_number ]
12
Data Model Bag elements are reachable by path
expressions. The path expression consists of two
parts : An entry point A relative forward part
Example: account_number:invoice
13
Operators Source S , Follow , Select , Join ,
Rename , Expose , Vertex , Group , Union , Intersection , Difference - , Cartesian Product .
14
Source Operator S Input : a list of documents Output :a collection of singleton bags Examples : S (*) All Known XML documentsS (invoice*.xml) All XML documents whose filename matches “invoice*.xmlS (*,schema.dtd) All known XML documents that conform to
schema.dtd
15
Follow operator Input : a path expression in entry
point notation Functionality : extracts vertices
reachable by path expression Output : a new bag that consist of
the extracted vertex + all the contents of the original bag (in care of unnesting follow)
16
Follow operator (Example*)
Root invoice.xml invoice
<invoice>Invoice-element-content
</invoice>
Root invoice.xml invoice invoice.carrier
<invoice>Invoice-element-content
</invoice>
<carrier>carrier -element-content
</carrier >
(carrier:invoice)*Unnesting Follow
{[Root invoice.xml , invoice]}
{[Root invoice.xml , invoice, invoice.carrier]}
17
Select operator Input : a set of bags Functionality : filters the bags of a
collection using a predicate Output : a set of bags that conform
to the predicate Predicate : Logical operator (,,), or simple
qualifications (,,,,,)
18
Select operator (Example)
invoice.carrier =Sprint
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
{[Root invoice.xml , invoice], [Root invoice.xml , invoice], ……………}
{[Root invoice.xml , invoice],… }
19
Join operator Input: two collections of bags Functionality: Joins the two
collections based on a predicate Output: the concatenation of pairs of
pages that satisfy the predicate
20
Join operator (Example)
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root customer.xml customer<customer>
customer-element-content</customer>
account_number: invoice =number:customer
Root invoice.xml invoice Root customer.xml customer<invoice>
Invoice-element-content</invoice>
<customer>customer-element-content
</customer>
{[Root invoice.xml , invoice]} {[Root customer.xml , customer]}
{[Root invoice.xml , invoice, Root customer.xml , customer]}
21
Expose operator Input: a list of path expressions of
vertices to be exposed Output: a set of bags that contains
vertices in the parameter list with the same order
22
Expose operator (Example)
Root invoice.xml invoice. bill_period invoice.carrier
<invoice>carrier-element-content
</invoice>
<carrier>bill_period -element-content
</carrier >
(bill_period,carrier)
{[Root invoice.xml , invoice.bill_period, invoice.carrier]}
Root invoice.xml invoice invoice.carrier invoice.bill_period
<invoice>Invoice-element-content
</invoice>
<carrier>bill_period -element-content
</carrier >
{[Root invoice.xml , invoice, invoice.carrier, invoice.bill_period]}
<invoice>carrier-element-content
</invoice>
23
Vertex operator
Creates the actual XML vertex that will encompass everything created by an expose operator
Example :
(Customer_invoice)[((account)[invoice.account_number], (inv_total)[invoice.total])]
24
Other operators Group : is used for arbitrary
grouping of elements based on their values Aggregate functions can be used with
the group operator (i.e. average) Rename : Changes the entry point
annotation of the elements of a bag. Example: (invoice.bill_period,date)
25
Example: XML Source Documents
Invoice.xml
<Invoice_Document>
<invoice>
<account_number>2 </account_number>
<carrier>AT&T</carrier>
<total>$0.25</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>Sprint</carrier>
<total>$1.20</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<total>$0.75</total>
</invoice>
<auditor> maria </auditor>
</Invoice_Document>
Customer.xml
<Customer_Document>
<customer>
<account>1 </account>
<name>Tom </name>
</customer >
<customer>
<account>2 </account>
<name>George </name>
</customer >
</Customer _Document>
26
Xquery ExampleList account number, customer name, and
invoice total for all invoices that has carrier = “Sprint”.
FOR $i in (invoices.xml)//invoice,
$c in (customers.xml)//customer
WHERE $i/carrier = “Sprint” and
$i/account_number= $c/account
RETURN
<Sprint_invoices>
$i/account_number,
$c/name,
$i/total
</Sprint_invoices>
27
Example: Xquery output
<Sprint_Invoice>
<account_number>1 </account_number>
<name>Tom </name>
<total>$1.20</total>
</Sprint_Invoice >
28
Algebra Tree Execution
customer (2) customer(1) Invoice (1) invoice (2) invoice (3)
Source (Invoices.xml) Source (cutomers.xml)
Follow (*.invoice) Follow (*.customer)
Select (carrier= “Sprint” )
invoice (2)
Join (*.invoice.account_number=*.customer.account)
invoice(2) customer(1)
Expose (*.account_number , *.name, *.total )
Account_number name total
29
Optimization with Niagara
Optimizer based on the Niagara algebra
Use the operation more efficiently
Produce simpler expression by combining operations
30
Language Convention A and B are path expressions A< B -- Path Expression A is
prefix of B AnB --- Common prefix of path
A and B AńB --- Greatest common of
path A and B ┴ --- Null path Expression
31
Use of Rule 8.5Make profit of rule 8.5
Allows optimization based on path selectivity
When applying un-nesting follow operation Φμ
32
Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)]
True WhenExist C / C <A && C < B
C = AńBOr AnB = ┴Interchangeability of Follow operation
33
Application of 8.5 With Invoice
Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] *
?=Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] **
Both Share the common prefix invoice
Case AńB = invoice
34
Benefit of Rule Application Note if:acc_Num required for each invoice Elementcarrier is not required for invoice Element
Then using *
Φμ(acc_Num:invoice)[Φμ(acc_Num:customer)]
make more sense than ** Why?
35
Reduction of Input Size on the firstSub-operation
Φμ(carrier:invoice)
Should we or can we apply the 8.5 below?Φμ(acc_Num:invoice)[Φμ(acc_Num:Customer)]Why?
36
acc_Num:invoice and
acc_Num:Customer are totally different path
Case is: AnB = ┴ Then yes
37
Rule 8.7 , 8.9 , 8.11 Interesting Helps identify
When and where to use selection to decrease size of input operation to
subsequent operationExample Algebra tree slide 28Selected before join.
38
Addition would be
Give computation for finding when rule can be applied automatically in a case and then apply it.
39
AT&T Algebra
40
41
AT&T Algebra Introduction
The algebra is derived from the nested relational algebra.
AT&T algebra makes heavy use of list comprehensions, a standard notation in the function programming community.
AT&T algebra uses the functional programming language Haskell as a notation from presenting the algebra.
42
AT&T data model The data model merges attribute and
element nodes, and eliminates comments.
Declare Basic Type: Node.Text :: String ->nodeelem :: Tag -> [Node] ->noderef :: Node ->Node
<<bibbib>> <<book yearbook year=“1999”>=“1999”> <<titletitle> Data on the Web</title>> Data on the Web</title> <year> 1999</year><year> 1999</year> </book></book>
</bib></bib>
elem “bib” [
elem “book”[
elem “@year” [ text “1999” ],
elem “title” [text “Data on the web” ] ]]
43
Basic Type Declarations To find the type of a node,
isText :: Node -> Bool isElem :: Node -> Bool isRef :: Node -> Bool
For a text node, string :: Node -> String For an element node,
1)tag :: Node -> Tag 2)children :: Node -> [Node]
For a reference node, dereference :: Node -> Node
44
Nested relational algebra… In the nested relational approach, data is
composed of tuples and lists. Tuple values and tuple types are written
in round brackets. (1999,"Data on theWeb",["Abiteboul"]) :: (Int,String,[String]) Decompose values: year :: (Int,String,[String]) year (x,y,l) = x
45
Nested relational algebra… Comprehensions: List comprehensions can
be used to express fundamental query operations, navigation, cartesian product, nesting, joins.
Example: [ value x | x <- children book0, is "author" x ]
==> [ "Abiteboul" ] Normal expression:[ exp | qual1,...,qualn ] bool-exp pat <- list-exp
46
Nested relational algebra… Using comprehensions to write queries.
Navigatefollow :: Tag -> Node -> [Node] follow t x = [ y | y <- children x, is t y ] Cartesian product[ (value y, value z) | x <- follow "book" bib0, y <- follow "title" x, z <- follow "author" x ] ==> [ ("Data on the Web", "Abiteboul")]
47
Nested relational algebra… Joins.
elem "reviews"elem "reviews" [ [
elem "book" [ elem "book" [
elem "title" [ text"Data on the elem "title" [ text"Data on the Web" ], Web" ],
elem "review" [ text "This is elem "review" [ text "This is great!" ]] great!" ]]
elem “bib” [
elem “book”[
elem “@year” [ text “1999” ],
elem “title” [text “Data on the web” ] ]]
[ (value y, int (value z), value w) | x <- follow "book" bib0,
y <- follow "title" x,
z <- follow "@year" x,
u <- follow "book" reviews0,
v <- follow "title" u,
w <- follow “@year" u,
y == v ]
==> [("Data on the Web", 1999, "This is great!")]
48
Nested relational algebra… Regular expression matching
( [ (x,y,u) | x <- item "@year", y <- item "title", u <- rep (item "author") ] ) :: Reg (Node,Node,[Node] ) match reg0 book0
==> [(elem "@year" [text "1999"], elem "title" [text "Data on the
Web"],
[elem "author" [text "Abiteboul"],
elem "author" [text "Buneman"],
elem "author" [text "Suciu"] ] ) ]
Match :: Reg a -> Node-> [a]
Result
49
Nested relational algebra… Sorting.
sortBy :: (a -> a -> Bool) -> [a] -> [a]
sortBy (<=) [3,1,2,1] ==> [1,1,2,3]
GroupinggroupBy :: (a -> a -> Bool) -> [a] -> [[a]] groupBy (==) [3,1,2,1] == [[2],[1,1],[3]]
50
Cross Comparisons of Algebra
Niagara and AT&T standalone XML algebras
Niagara proposed after W3C had selected proposed standard
and has operators which operate on sets of bags
At&T algebra chosen as proposed standard by W3C
-- expressions resemble high level query language -- latest version of document referred to as “Semantics of XML Query Language XQuery”
51
Future Work
Need more different evaluation strategies which would allow for flexible query plans
Develop physical operators that take advantage of physical storage structures and generate mapping
from query tree to a physical query plan