1XStreamCast OGI 11/12/03
XStreamCast:Broadcasting and Query Processing of
Streamed XML
Leonidas Fegaras
University of Texas at Arlington
2XStreamCast OGI 11/12/03
The XStreamCast Group
Faculty:
Leonidas Fegaras
David Levine
PhD Students:
Sujoe Bose
Weimin He
Hao Zhou
Tejas Shah
Masters Students:
Vamsi K. Chaluvadi
Darsan Tatineni
Sravani Reddy
Funded by NSF (will start on 1/1/04).
Web page: http://lambda.uta.edu/XStreamCast/
3XStreamCast OGI 11/12/03
The XStreamCast Architecture
Most web servers are pull-based: A client submits a request, the server returns the requested data. This doesn’t scale very well for a very large number of clients who request similar query results.
Pushed-based dissemination: A server multicasts a stream of data to registered clients.
In our framework:• A client registers with a server using a pull-based web service• A server multicasts data to registered clients in a continuous stream
– Data are often derived by merging multiple input streams (eg sensor data)– The server does not have any knowledge about the client queries– The only task performed by the server is slicing, scheduling, and multicasting data:
• Critical data may be repeated more often than no-critical data• Invalid data may be revoked• New updates may be broadcast as soon as they become available.
• A client connects to multiple streams and evaluates continuous queries locally– It doesn’t register queries with the servers– All processing is done at the client side – No handshaking, no error-correction
4XStreamCast OGI 11/12/03
The XStreamCast Data Model
• Based on XML rather than on flat relational data• The server slices an XML data source into XML fragments.
Each fragment:– is a filler that fills a hole– may contain holes, which can be filled by other fragments– is wrapped with control information, such as its unique hole ID, the path
that reaches this fragment, etc.
• Hole IDs– are similar to surrogates but are hidden from clients– are less restrictive than hierarchical key structures
• A continuous stream consists of a fragmented XML data source followed by continuous updates– The unit of update is a fragment– Snapshot view: a hole ID is associated with the latest update– Temporal view: a hole ID is associated with the sequence of all updates
5XStreamCast OGI 11/12/03
The Fragmented Hole-Filler Model
<commodities> <vendor> <name> Wal-Mart </name> <items> <stream:hole id="10" tsid="5"/> <stream:hole id="20" tsid="5"/> ... </vendor> ...</commodities>
<stream:filler id="10" tsid="5">
<item>
<name> PDA </name>
<make> HP </make>
<model> PalmPilot </model>
<price currency="USD">315.25<price>
</item>
</stream:filler>
<stream:filler id="20" tsid="5">
<item>
<name> Calculator </name>
<make> Casio </make>
<model> FX-100 </model>
<price currency="USD">50.25<price>
</item>
</stream:filler>
6XStreamCast OGI 11/12/03
Query Processing
A client opens connections to streams and evaluates XQueries against these streams
• The data view at the client side is the unfragmented data source
• For large streams, it’s a bad idea to reconstruct the streamed data in client’s memory– need to process fragments as soon they become available from the server
• Some operators block or require unbounded memory:– Sorting
– Joins between two streams or self-joins
– Group-by with aggregation.
7XStreamCast OGI 11/12/03
Rest of the Talk
• An algebra for stored XML data
• An algebra for streamed XML data (snapshot view)
• The XCQL query language for querying time-varying streamed XML data (temporal view)
• Schema-based translation of XCQL
8XStreamCast OGI 11/12/03
An Algebra for Stored XML Data
Based on the nested-relational algebra:
v(T) access the XML data source T using v
pred(X) select fragments from X that satisfy pred
v1,….,vn(X) project
X Y merge
X predY join
predv,path (X) unnest (retrieve descendents of elements)
pred,h (X) apply h and reduce by
gs,predv,,h(X) group-by gs, apply h to each group,
and reduce each group by
9XStreamCast OGI 11/12/03
Semantics
v(T) = { < v = T > }
pred(X) = { t | t X, pred(t) }
v1,….,vn(X) = { <v1=t.v1,…,vn=t.vn> | t X }
X Y = X ++ Y
X predY = { tx ty | tx X, ty Y, pred(tx,ty) }
predv,path(X) = { t <v=w> | t X, w PATH(t,path),
pred(t,w) }
pred,h (X) = /{ h(t) | t X, pred(t) }
gs,predv,,h (X) = …
10XStreamCast OGI 11/12/03
XPath Expressions
• Path evaluation is central to the algebra:
PATH: ( XML-data, simple-XPath ) set(XML-data)
• Some rules for stored XML data:
PATH(<A>x</A>,A/path) = PATH(x,path)
PATH(<A>x</A>,A) = { <A>x</A> }
PATH(x1 x2,path) = PATH(x1,path) PATH(x2,path)
PATH(x,path) = otherwise
• Predicates have existential semantics
$v/A/B = “text” x PATH(v,A/B): x = “text”
11XStreamCast OGI 11/12/03
Transforming XQueries to the Algebra
Transformation steps:
1. XQueries to list comprehensions– XPath terms to simple paths without predicates
2. Normalization of nested comprehensions– Generator domains are normalized into simple path expressions
3. List comprehension to XML Algebra
12XStreamCast OGI 11/12/03
Example #1
where
13XStreamCast OGI 11/12/03
Example #1 (cont.)
,element(“book”,$b/title)
$v/bib/book
$b
$v
document(“http://www.bn.com”)
$b/publisher=“Addison-Wesley” and $b/@year > 1991
14XStreamCast OGI 11/12/03
Example #2
for $u in document(“users.xml”)//user_tuple
return <user> { $u/name }
{ for $b in document(“bids.xml”)//bid_tuple[userid=$u/userid]/itemno
$i in document(“items.xml”)//item_tuple[itemno=$b]
return <bid> { $i/description/text() } </bid>
sortby(.) }
</user>
sortby(name)
document(“users.xml”)
$us
$us/users/user_tuple
document(“bids.xml”)
$bs
$bs/bids/bid_tuple
document(“items.xml”)
$is
$is/items/item_tuple$u
$i$b
$c/itemno
$c/userid=$u/userid
$c
$i/itemno=$b
sort, elem(“bid”,$i/description/text())
sort($u/name), elem(“user”,$u/name++ �)
15XStreamCast OGI 11/12/03
Algebraic Optimization
• Optimizing query expressions as in relational algebra
• Query unnesting– Nested queries executed in nested loop fashion
– Not possible in stream based processing
– Blocking operators replaced with non-blocking outer versions
16XStreamCast OGI 11/12/03
Rest of the Talk
• An algebra for stored XML data
• An algebra for streamed XML data (snapshot view)
• The XCQL query language for querying time-varying streamed XML data (temporal view)
• Schema-based translation of XCQL
17XStreamCast OGI 11/12/03
The Streamed XML Algebra
Much like the stored XML algebra, but works on streams.
The streams between operators are streams of tuples with fragments as tuple components.
An input fragment is stored on a central state (which can be garbage-collected) but can also be attached to tuples streamed through operators.
A stream between operators takes the forms:
• t ; ’ a tuple of fragments t followed by the rest of the stream ’
• Eos end-of-stream
Each stored XML algebraic operator has a streamed counterpart
eg, pred(t ; ) = t ; pred() if pred is true for t
pred(t ; ) = pred() otherwise
pred(eos) = eos
but …
we may not be able to validate pred due to holes in t.
18XStreamCast OGI 11/12/03
Streamed Algebra Semantics
• To keep the suspended fragments, each streamed algebraic operator has – one state 0 for the output and
– optional state(s) 1/2 for the input(s)
• The result of PATH may now be unspecified:PATH(<hole id=“m” …>,path) = PATH( (m),path) if m
= { } otherwise
• When in predicates, requires 3-value logic
• Tuples with incomplete fragments are suspended when necessary, eg:
pred(t ; ) = t ; pred() if truePATH(t,pred)
pred(t ; ) = pred() otherwise
0 0 {t} if PATH(t,pred)
19XStreamCast OGI 11/12/03
Join
Much like main-memory symmetric join
• states: 0 all suspended output tuples due to unfilled holes
1 all tuples from left stream
2 all tuples from right stream
• a tuple from left stream:(t1;1) pred2 = { t1 t2 | t22, truePATH(t1 t2,pred) }; (1 pred2)
1 1 t1
0 0 { t1 t2 | t22, PATH(t1 t2,pred) }
• a tuple from right stream:1 pred (t2; 2 ) = { t1 t2 | t11, truePATH(t1 t2,pred) }; (1 pred2)
2 2 t2
0 0 { t1 t2 | t11, PATH(t1 t2,pred) }
20XStreamCast OGI 11/12/03
Reconstructing the XML Data
: set(int XML-data) is an environment that binds filler ids to XML.
x replaces holes with fillers in x using the environment :
<A> x </A> = <A> x </A>
(x1 x2) = (x1 ) (x2 )
<hole id=“m” …> = [m] if mx = x otherwise
R() returns a pair (a,), where and a is [0] (the reconstructed data):
if R() = (a,) then
R(<filler id=“m” x>; ) =
R(eos) = (,)
Basically, R(t ; ) = f(R())
(x , ) if m=0
(a ’, ’) if m0 where ’={(m,x )} [m/x]{
21XStreamCast OGI 11/12/03
Equivalence Between Stored & Streamed Algebras
If we reconstruct the XML document from the streamed fragments and evaluate a query using the stored algebra, we get the same result as when we use the equivalent streamed algebra over the streamed XML fragments and reconstruct the result.
XML document
XML fragments
result
reconstruction
stored XML algebra
streamed XML algebraXML fragments
reconstruction
Proof sketch: We prove R(p())=p(R()) inductively, where p is the stream version of p. If truePATH(t,pred), then R(p(t;))=R(t;p())=f(R(p()))=f(p(R()))
=p(f(R())) =p(R(t;)) …
22XStreamCast OGI 11/12/03
Rest of the Talk
• An algebra for stored XML data
• An algebra for streamed XML data (snapshot view)
• The XCQL query language for querying time-varying streamed XML data (temporal view)
• Schema-based translation of XCQL
23XStreamCast OGI 11/12/03
A Data Model for Temporal XML
Based on Hole-Filler model but:
• A fragment is now associated with a timestamp
• A Hole may be associated with a sequence of fragments, say (<f1,t1>,…,<fn,tn>), sorted by timestamp ti.
– The ith version of this hole is fi
– The “last” version is fn
– The lifespan of the fragment fi is [ti,ti+1], where tn+1 is “now”
– The snapshot XML data are derived by ignoring all but the last version
• Holes, fragments, and timestamps are hidden from clients
• The client sees a temporal view, which can be queried by XCQL
24XStreamCast OGI 11/12/03
XCQL: Continuous Query Language for XML
• It is basically XQuery extended with interval and version projections
• Inspired by Stanford’s CQL (which is based on SQL)
• Without using the extensions, XCQL is equivalent to XQuery over the snapshot data
• Extensions:1. Interval projection: e?[t1,t2] shortcut: e?[t] = e?[t,t]
where t can be any XQuery time expression, including “now” and “start”
2. Version projection: e@[v1,v2] shortcut: e@[v] = e@[v,v]where v is any integer expression, including “last”
3. Valid time begin: vtFrom(e)
4. Valid time end: vtTo(e)
25XStreamCast OGI 11/12/03
Example
• A network management system receives two streams from a backbone router for TCP connections: one for SYN packages and another for ACK packages that acknowledge the receipt. We want to identify the misbehaving packages that do not receive an acknowledgment within a minute:
for $s in stream("syn")//packet,
$a in stream("ack")//packet?[vtFrom($s)+1min,now]
where $s/id = $a/id
and $s/srcIP = $a/destIP
and $s/srcPort = $a/destPort
return <warning> { $s/id } </warning>
26XStreamCast OGI 11/12/03
The Temporal View
Deriving the temporal view from the fragmented stream:define function temporalize ( $tag as element()* ) { for $e in $tag return if (not(empty($e/*))) then element {name($e)} {$e/@*, temporalize($e/*)} else if (name($e)="hole") then temporalize(get_fillers($e/@id)) else $e} define function get_fillers ( $fid as xs:integer ){ let $fillers := doc("fragments.xml")/fragments/filler[@id=$fid] for $f at $p in $fillers let $e := $f/* order by ./@validTime return element {name($e)} { $e/@*, attribute vtFrom {$f/@validTime}, attribute vtTo { if ($p = count($fillers)) then "now" else $fillers[$p+1]/@validTime }, $e/node() }}
27XStreamCast OGI 11/12/03
Translation of XCQL into XQuery
e?[tb,te] is translated into interval_projection(e,tb,te)
e@[vb,ve] is translated into version_projection(e,vb,ve)
define function interval_projection ($e as element(), $tb as xs:time, $te as xs:time){ if (!$e/@vtFrom) element {name($e)} { for $c in $e/* return interval_projection($c,$tb,$te) } else if (!interval_intersection($e/vtFrom,$e/vtTo,$tb,$te)) return () else element {name($e)} { attribute vtFrom {max($e/vtFrom,$tb)}, attribute vtTo {min($e/vtTo,$te)}, for $c in $e/* return interval_projection($c,$tb,$te) }}
28XStreamCast OGI 11/12/03
Rest of the Talk
• An algebra for stored XML data
• An algebra for streamed XML data (snapshot view)
• The XCQL query language for querying time-varying streamed XML data (temporal view)
• Schema-based translation of XCQL
29XStreamCast OGI 11/12/03
The recursion in temporalize, interval_projection, etc, can be eliminated if we know– the complete schema, or
– the structural summary
<tag name=“creditAccounts”>
<temporal name=“account”>
<tag name=“customer”/>
<tag name=“creditLimit/>
<event name=“transaction”>
<tag name=“vendor”/>
<tag name=“amount”/>
<tag name=“status”/>
</event>
</temporal>
</tag>
Recursion is Hard to Optimize
Fragmentation can only be done on temporal or event nodes.
Temporal: has lifespan [vtFrom,vtTo]
Event: occurs at one point of time (vtFrom=vtTo)
30XStreamCast OGI 11/12/03
Schema-Based Mapping
define function temporalizeCreditAccounts ( $e1 as element() ) as element()
{ <creditAccounts> { for $e2 in $e1/hole, $e3 in get_fillers($e2/@id) return <account> { $e3/customer, $e3/type, $e3/creditLimit, for $e4 in $e3/hole, $e5 in get_fillers($e4/@id) return <transaction> { $e5/vendor, $e5/amount, $e5/status } </transaction> } </account> } </creditAccounts> }
31XStreamCast OGI 11/12/03
Example
Query:doc(“creditSystem.xml”)/account/transaction[amount > 1000]
Default translation:get_fillers_list(get_fillers_list(get_fillers_list(0)/account/hole/@id)
/transaction/hole/@id)[amount > 1000]
Using schema-based translation:temporalizeCreditAccounts(get_fillers(0))/account/transaction
[amount > 1000]
Optimized (optimistic) translation:doc("fragments.xml")/fragments/filler/transaction[amount > 1000]
32XStreamCast OGI 11/12/03
Future Work
• Optimal fragmentation and scheduling of fragments based on client profiles
• Query optimization of XCQL
• Design main memory evaluation techniques for XML fragments
• Implement the framework!
• Application domain: network management