xstreamcast ogi 11/12/03 1 xstreamcast: broadcasting and query processing of streamed xml leonidas...

32
1 XStreamCast OGI 11/12/03 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

Upload: opal-wells

Post on 02-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

1XStreamCast OGI 11/12/03

XStreamCast:Broadcasting and Query Processing of

Streamed XML

Leonidas Fegaras

University of Texas at Arlington

Page 2: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

2XStreamCast OGI 11/12/03

The XStreamCast Group

Faculty:

Leonidas Fegaras

David Levine

PhD Students:

Sujoe Bose

Weimin He

Hao Zhou

Tejas Shah

Masters Students:

Vamsi K. Chaluvadi

Darsan Tatineni

Sravani Reddy

Funded by NSF (will start on 1/1/04).

Web page: http://lambda.uta.edu/XStreamCast/

Page 3: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

3XStreamCast OGI 11/12/03

The XStreamCast Architecture

Most web servers are pull-based: A client submits a request, the server returns the requested data. This doesn’t scale very well for a very large number of clients who request similar query results.

Pushed-based dissemination: A server multicasts a stream of data to registered clients.

In our framework:• A client registers with a server using a pull-based web service• A server multicasts data to registered clients in a continuous stream

– Data are often derived by merging multiple input streams (eg sensor data)– The server does not have any knowledge about the client queries– The only task performed by the server is slicing, scheduling, and multicasting data:

• Critical data may be repeated more often than no-critical data• Invalid data may be revoked• New updates may be broadcast as soon as they become available.

• A client connects to multiple streams and evaluates continuous queries locally– It doesn’t register queries with the servers– All processing is done at the client side – No handshaking, no error-correction

Page 4: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

4XStreamCast OGI 11/12/03

The XStreamCast Data Model

• Based on XML rather than on flat relational data• The server slices an XML data source into XML fragments.

Each fragment:– is a filler that fills a hole– may contain holes, which can be filled by other fragments– is wrapped with control information, such as its unique hole ID, the path

that reaches this fragment, etc.

• Hole IDs– are similar to surrogates but are hidden from clients– are less restrictive than hierarchical key structures

• A continuous stream consists of a fragmented XML data source followed by continuous updates– The unit of update is a fragment– Snapshot view: a hole ID is associated with the latest update– Temporal view: a hole ID is associated with the sequence of all updates

Page 5: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

5XStreamCast OGI 11/12/03

The Fragmented Hole-Filler Model

<commodities> <vendor> <name> Wal-Mart </name> <items> <stream:hole id="10" tsid="5"/> <stream:hole id="20" tsid="5"/> ... </vendor> ...</commodities>

<stream:filler id="10" tsid="5">

<item>

<name> PDA </name>

<make> HP </make>

<model> PalmPilot </model>

<price currency="USD">315.25<price>

</item>

</stream:filler>

<stream:filler id="20" tsid="5">

<item>

<name> Calculator </name>

<make> Casio </make>

<model> FX-100 </model>

<price currency="USD">50.25<price>

</item>

</stream:filler>

Page 6: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

6XStreamCast OGI 11/12/03

Query Processing

A client opens connections to streams and evaluates XQueries against these streams

• The data view at the client side is the unfragmented data source

• For large streams, it’s a bad idea to reconstruct the streamed data in client’s memory– need to process fragments as soon they become available from the server

• Some operators block or require unbounded memory:– Sorting

– Joins between two streams or self-joins

– Group-by with aggregation.

Page 7: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

7XStreamCast OGI 11/12/03

Rest of the Talk

• An algebra for stored XML data

• An algebra for streamed XML data (snapshot view)

• The XCQL query language for querying time-varying streamed XML data (temporal view)

• Schema-based translation of XCQL

Page 8: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

8XStreamCast OGI 11/12/03

An Algebra for Stored XML Data

Based on the nested-relational algebra:

v(T) access the XML data source T using v

pred(X) select fragments from X that satisfy pred

v1,….,vn(X) project

X Y merge

X predY join

predv,path (X) unnest (retrieve descendents of elements)

pred,h (X) apply h and reduce by

gs,predv,,h(X) group-by gs, apply h to each group,

and reduce each group by

Page 9: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

9XStreamCast OGI 11/12/03

Semantics

v(T) = { < v = T > }

pred(X) = { t | t X, pred(t) }

v1,….,vn(X) = { <v1=t.v1,…,vn=t.vn> | t X }

X Y = X ++ Y

X predY = { tx ty | tx X, ty Y, pred(tx,ty) }

predv,path(X) = { t <v=w> | t X, w PATH(t,path),

pred(t,w) }

pred,h (X) = /{ h(t) | t X, pred(t) }

gs,predv,,h (X) = …

Page 10: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

10XStreamCast OGI 11/12/03

XPath Expressions

• Path evaluation is central to the algebra:

PATH: ( XML-data, simple-XPath ) set(XML-data)

• Some rules for stored XML data:

PATH(<A>x</A>,A/path) = PATH(x,path)

PATH(<A>x</A>,A) = { <A>x</A> }

PATH(x1 x2,path) = PATH(x1,path) PATH(x2,path)

PATH(x,path) = otherwise

• Predicates have existential semantics

$v/A/B = “text” x PATH(v,A/B): x = “text”

Page 11: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

11XStreamCast OGI 11/12/03

Transforming XQueries to the Algebra

Transformation steps:

1. XQueries to list comprehensions– XPath terms to simple paths without predicates

2. Normalization of nested comprehensions– Generator domains are normalized into simple path expressions

3. List comprehension to XML Algebra

Page 12: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

12XStreamCast OGI 11/12/03

Example #1

where

Page 13: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

13XStreamCast OGI 11/12/03

Example #1 (cont.)

,element(“book”,$b/title)

$v/bib/book

$b

$v

document(“http://www.bn.com”)

$b/publisher=“Addison-Wesley” and $b/@year > 1991

Page 14: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

14XStreamCast OGI 11/12/03

Example #2

for $u in document(“users.xml”)//user_tuple

return <user> { $u/name }

{ for $b in document(“bids.xml”)//bid_tuple[userid=$u/userid]/itemno

$i in document(“items.xml”)//item_tuple[itemno=$b]

return <bid> { $i/description/text() } </bid>

sortby(.) }

</user>

sortby(name)

document(“users.xml”)

$us

$us/users/user_tuple

document(“bids.xml”)

$bs

$bs/bids/bid_tuple

document(“items.xml”)

$is

$is/items/item_tuple$u

$i$b

$c/itemno

$c/userid=$u/userid

$c

$i/itemno=$b

sort, elem(“bid”,$i/description/text())

sort($u/name), elem(“user”,$u/name++ �)

Page 15: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

15XStreamCast OGI 11/12/03

Algebraic Optimization

• Optimizing query expressions as in relational algebra

• Query unnesting– Nested queries executed in nested loop fashion

– Not possible in stream based processing

– Blocking operators replaced with non-blocking outer versions

Page 16: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

16XStreamCast OGI 11/12/03

Rest of the Talk

• An algebra for stored XML data

• An algebra for streamed XML data (snapshot view)

• The XCQL query language for querying time-varying streamed XML data (temporal view)

• Schema-based translation of XCQL

Page 17: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

17XStreamCast OGI 11/12/03

The Streamed XML Algebra

Much like the stored XML algebra, but works on streams.

The streams between operators are streams of tuples with fragments as tuple components.

An input fragment is stored on a central state (which can be garbage-collected) but can also be attached to tuples streamed through operators.

A stream between operators takes the forms:

• t ; ’ a tuple of fragments t followed by the rest of the stream ’

• Eos end-of-stream

Each stored XML algebraic operator has a streamed counterpart

eg, pred(t ; ) = t ; pred() if pred is true for t

pred(t ; ) = pred() otherwise

pred(eos) = eos

but …

we may not be able to validate pred due to holes in t.

Page 18: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

18XStreamCast OGI 11/12/03

Streamed Algebra Semantics

• To keep the suspended fragments, each streamed algebraic operator has – one state 0 for the output and

– optional state(s) 1/2 for the input(s)

• The result of PATH may now be unspecified:PATH(<hole id=“m” …>,path) = PATH( (m),path) if m

= { } otherwise

• When in predicates, requires 3-value logic

• Tuples with incomplete fragments are suspended when necessary, eg:

pred(t ; ) = t ; pred() if truePATH(t,pred)

pred(t ; ) = pred() otherwise

0 0 {t} if PATH(t,pred)

Page 19: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

19XStreamCast OGI 11/12/03

Join

Much like main-memory symmetric join

• states: 0 all suspended output tuples due to unfilled holes

1 all tuples from left stream

2 all tuples from right stream

• a tuple from left stream:(t1;1) pred2 = { t1 t2 | t22, truePATH(t1 t2,pred) }; (1 pred2)

1 1 t1

0 0 { t1 t2 | t22, PATH(t1 t2,pred) }

• a tuple from right stream:1 pred (t2; 2 ) = { t1 t2 | t11, truePATH(t1 t2,pred) }; (1 pred2)

2 2 t2

0 0 { t1 t2 | t11, PATH(t1 t2,pred) }

Page 20: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

20XStreamCast OGI 11/12/03

Reconstructing the XML Data

: set(int XML-data) is an environment that binds filler ids to XML.

x replaces holes with fillers in x using the environment :

<A> x </A> = <A> x </A>

(x1 x2) = (x1 ) (x2 )

<hole id=“m” …> = [m] if mx = x otherwise

R() returns a pair (a,), where and a is [0] (the reconstructed data):

if R() = (a,) then

R(<filler id=“m” x>; ) =

R(eos) = (,)

Basically, R(t ; ) = f(R())

(x , ) if m=0

(a ’, ’) if m0 where ’={(m,x )} [m/x]{

Page 21: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

21XStreamCast OGI 11/12/03

Equivalence Between Stored & Streamed Algebras

If we reconstruct the XML document from the streamed fragments and evaluate a query using the stored algebra, we get the same result as when we use the equivalent streamed algebra over the streamed XML fragments and reconstruct the result.

XML document

XML fragments

result

reconstruction

stored XML algebra

streamed XML algebraXML fragments

reconstruction

Proof sketch: We prove R(p())=p(R()) inductively, where p is the stream version of p. If truePATH(t,pred), then R(p(t;))=R(t;p())=f(R(p()))=f(p(R()))

=p(f(R())) =p(R(t;)) …

Page 22: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

22XStreamCast OGI 11/12/03

Rest of the Talk

• An algebra for stored XML data

• An algebra for streamed XML data (snapshot view)

• The XCQL query language for querying time-varying streamed XML data (temporal view)

• Schema-based translation of XCQL

Page 23: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

23XStreamCast OGI 11/12/03

A Data Model for Temporal XML

Based on Hole-Filler model but:

• A fragment is now associated with a timestamp

• A Hole may be associated with a sequence of fragments, say (<f1,t1>,…,<fn,tn>), sorted by timestamp ti.

– The ith version of this hole is fi

– The “last” version is fn

– The lifespan of the fragment fi is [ti,ti+1], where tn+1 is “now”

– The snapshot XML data are derived by ignoring all but the last version

• Holes, fragments, and timestamps are hidden from clients

• The client sees a temporal view, which can be queried by XCQL

Page 24: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

24XStreamCast OGI 11/12/03

XCQL: Continuous Query Language for XML

• It is basically XQuery extended with interval and version projections

• Inspired by Stanford’s CQL (which is based on SQL)

• Without using the extensions, XCQL is equivalent to XQuery over the snapshot data

• Extensions:1. Interval projection: e?[t1,t2] shortcut: e?[t] = e?[t,t]

where t can be any XQuery time expression, including “now” and “start”

2. Version projection: e@[v1,v2] shortcut: e@[v] = e@[v,v]where v is any integer expression, including “last”

3. Valid time begin: vtFrom(e)

4. Valid time end: vtTo(e)

Page 25: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

25XStreamCast OGI 11/12/03

Example

• A network management system receives two streams from a backbone router for TCP connections: one for SYN packages and another for ACK packages that acknowledge the receipt. We want to identify the misbehaving packages that do not receive an acknowledgment within a minute:

for $s in stream("syn")//packet,

$a in stream("ack")//packet?[vtFrom($s)+1min,now]

where $s/id = $a/id

and $s/srcIP = $a/destIP

and $s/srcPort = $a/destPort

return <warning> { $s/id } </warning>

Page 26: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

26XStreamCast OGI 11/12/03

The Temporal View

Deriving the temporal view from the fragmented stream:define function temporalize ( $tag as element()* ) { for $e in $tag return if (not(empty($e/*))) then element {name($e)} {$e/@*, temporalize($e/*)} else if (name($e)="hole") then temporalize(get_fillers($e/@id)) else $e} define function get_fillers ( $fid as xs:integer ){ let $fillers := doc("fragments.xml")/fragments/filler[@id=$fid] for $f at $p in $fillers let $e := $f/* order by ./@validTime return element {name($e)} { $e/@*, attribute vtFrom {$f/@validTime}, attribute vtTo { if ($p = count($fillers)) then "now" else $fillers[$p+1]/@validTime }, $e/node() }}

Page 27: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

27XStreamCast OGI 11/12/03

Translation of XCQL into XQuery

e?[tb,te] is translated into interval_projection(e,tb,te)

e@[vb,ve] is translated into version_projection(e,vb,ve)

define function interval_projection ($e as element(), $tb as xs:time, $te as xs:time){ if (!$e/@vtFrom) element {name($e)} { for $c in $e/* return interval_projection($c,$tb,$te) } else if (!interval_intersection($e/vtFrom,$e/vtTo,$tb,$te)) return () else element {name($e)} { attribute vtFrom {max($e/vtFrom,$tb)}, attribute vtTo {min($e/vtTo,$te)}, for $c in $e/* return interval_projection($c,$tb,$te) }}

Page 28: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

28XStreamCast OGI 11/12/03

Rest of the Talk

• An algebra for stored XML data

• An algebra for streamed XML data (snapshot view)

• The XCQL query language for querying time-varying streamed XML data (temporal view)

• Schema-based translation of XCQL

Page 29: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

29XStreamCast OGI 11/12/03

The recursion in temporalize, interval_projection, etc, can be eliminated if we know– the complete schema, or

– the structural summary

<tag name=“creditAccounts”>

<temporal name=“account”>

<tag name=“customer”/>

<tag name=“creditLimit/>

<event name=“transaction”>

<tag name=“vendor”/>

<tag name=“amount”/>

<tag name=“status”/>

</event>

</temporal>

</tag>

Recursion is Hard to Optimize

Fragmentation can only be done on temporal or event nodes.

Temporal: has lifespan [vtFrom,vtTo]

Event: occurs at one point of time (vtFrom=vtTo)

Page 30: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

30XStreamCast OGI 11/12/03

Schema-Based Mapping

define function temporalizeCreditAccounts ( $e1 as element() ) as element()

{ <creditAccounts> { for $e2 in $e1/hole, $e3 in get_fillers($e2/@id) return <account> { $e3/customer, $e3/type, $e3/creditLimit, for $e4 in $e3/hole, $e5 in get_fillers($e4/@id) return <transaction> { $e5/vendor, $e5/amount, $e5/status } </transaction> } </account> } </creditAccounts> }

Page 31: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

31XStreamCast OGI 11/12/03

Example

Query:doc(“creditSystem.xml”)/account/transaction[amount > 1000]

Default translation:get_fillers_list(get_fillers_list(get_fillers_list(0)/account/hole/@id)

/transaction/hole/@id)[amount > 1000]

Using schema-based translation:temporalizeCreditAccounts(get_fillers(0))/account/transaction

[amount > 1000]

Optimized (optimistic) translation:doc("fragments.xml")/fragments/filler/transaction[amount > 1000]

Page 32: XStreamCast OGI 11/12/03 1 XStreamCast: Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

32XStreamCast OGI 11/12/03

Future Work

• Optimal fragmentation and scheduling of fragments based on client profiles

• Query optimization of XCQL

• Design main memory evaluation techniques for XML fragments

• Implement the framework!

• Application domain: network management