niagaracq : a scalable continuous query system for internet databases (modified slides available on...
TRANSCRIPT
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage)
Jianjun Chen et al Computer Sciences Dept. University of Wisconsin-Madison
SIGMOD 2000 Talk by Naresh Kumar
Outline Motivation
What is NiagaraCQ ?
General strategy of incremental group optimization
Query split scheme with materialized intermediate files
Incremental grouping of selection and join operators
Experimental Details
Data Stream Management System(DSMS)
The two fundamental difference btn DBMS and DSMS
In addition to managing traditional stored data, DSMS must handle multiple continuous, unbounded, possibly rapid data streams
A DSMS supports long running continuous queries.
Continuous Queries(CQ)
Continuous queries are persistent queries that allow user to receive results when they become available.Example
Notify me when ever price of Dell stock drops by more than 5%A broad classification
Change based Timer based
Motivation Continuous queries are growingly popular.
Notifies user with out querying repeatedly.
Much useful in internet where information changes frequently
Challenges: Need to be able to support million of
queries due to scale of internet.
Solution - NiagaraCQ
Previous Attempts Previous group optimization efforts focused on finding an optimal plan for a small number of similar queries Not applicable to a continuous query system for the following reasons: Computationally too expensive to
handle a large number of queries. Not designed for an environment like
the web where CQ s are added or removed dynamically.
What is NiagaraCQ ? NiagaraCQ–A CQ system for the internet
It is built on the assumption that Many queries tend to be similar to one
another. Similar queries can be grouped
together
It supports scalable continuous query processing over multiple, distributed XML files
NiagaraCQ - Approaches Advantages of grouping : Grouped queries can share
computation. They can reside in memory saving
IO-cost Avoid unnecessary invocations by
testing many CQ together Handles both change based and timer based queries in a uniform way To ensure scalability: Incremental evaluation of CQ's Memory caching
NiagaraCQ command language
Creating a CQCreate CQ_name XML-QL queryDo action { START start_time} { EVERY
time_interval} { EXPIRE expiration_time}
Delete CQ_name
Expression SignatureQuery examples Where <Quotes><Quote><Symbol>INTC</></></> element_as $g in “http://www.stock.com/quotes.xml”
construct $g
Where <Quotes><Quote><Symbol>MSFT</></></> element_as $g in “http://www.stock.com/quotes.xml”
construct $g
Expression signatures
Quotes.Quote.Symbol in quotes.xml
constant
=
Query plans
Trigger Action I Trigger Action J
File Scan
Select Symbol = “MSFT”
Select Symbol = “INTC”
File Scan
quotes.xml quotes.xml
Group
Group – signature, constant table, planGroup Signature
Common signature of all queries in the group
Group constant table
Dest_buffer Constant_value
Dest. J MSFT
Dest. I INTC
The group plan
Incremental Grouping Algo
When a new query is submittedIf the expression signature of the new query matches that of existing groups
Break the query plan into two partsRemove the lower partAdd the upper part onto the group
planelse create a new group
New Query**AOL added to Constant Table
**new destination buffer allocated**Matching process continues until top
Query split scheme
Matching Process will continue on the remainder of query plan until top of plan is reached
Thus, each continuous query is split into several smaller queries such that inputs of each of these queries are monitored using the same techniques that are used for the inputs of user defined continuous queries.
Incremental group optimization is very efficient because it only requires one traversal of query
Query split with materialized intermediate files
Destination buffer for the split operator can be implemented in Pipelined scheme Intermediate file scheme
Disadvantages of pipelined scheme It does not work for grouping timer based queries Gives a single complicated execution plan Combined plan may be very large and require
resources beyond limit of system A large portion of query plan may not need to be
executed at each invocation Split operator may block simple queries
Query split with materialized intermediate files(cont...)
Using intermediate files Split operator writes each output stream to a file Cut query plan into 2 parts at split operator Add a file scan operator to upper part to read
intermediate file Intermediate files are monitored just like other
data sources Intermediate file names are stored in the constant
table Grouped CQ with same constant share same
intermediate file
The query split scheme
Trade-offsOther advantages of materialized intermediate files
Only the necessary queries are executed. Thus computation time reduced
Tree structured query format – can easily scheduled and executed by general query engine
Uniform handling of intermediate files and original data source files
Bottle neck problem is avoided
Disadvantages Split operator becomes a blocking operator Extra disk I/Os
Range Predicates
E.g. R.a < val or val1 < R.a < val2Multiple such rangesProblem
Intermediate files may contain duplicate tuples
Idea: Virtual intermediate files Use an index to implement this
Incremental grouping of selection predicates
Multiple selection predicates in a query CNF for predicates on same data source Incremental grouping
Choose the most selective conjunct and implement virtual file on this conjunct
Example queryWhere <Quotes><Quote><Symbol>”INTC”</> <Current_Price>$p</></> element_as $g </> in “quotes.xml”, $p < 100Construct $g
Incremental grouping of join operators
Join operators are usually expensive, sharing common join operations can significantly reduce the amount of computation.A join queryQuotes.Quote.Change_Ratio constant in “quotes.xml”
Where <Quotes><Quote><Symbol>$s</></>element_as $g </> in “quotes.xml”,
<Companies><Company><Symbol>$s</></>element_as $t</> in “companies.xml”
construct $g, $t
Queries that contain both join and selection
Example query :Where <Quotes><Quote><Symbol>$s</>
<Industry>”Computer Service”</></>element_as $g </> in “quotes.xml”,
<Companies><Company><Symbol>$s</></>element_as $t</> in “companies.xml”
construct $g, $t
Where to place the selection operator ? Below the join
Eliminates irrelevant tuples Above the join
Allows sharing Pick based on cost model
Grouping timer-based queries
Challenge Hard to monitor the timer events of
queries Sharing common computation becomes
difficult
Event Detection Stores time events sorted in time order Each query has an ID
Incremental evaluation
Invoke queries only on changed data
For each source file, NiagaraCQ keeps a delta file Also for the intermediate files Time stamp store the each tuple – for timer
based queries
Incremental evaluation of join operators requires complete data files
Memory Caching
Thousands of continuous queries can’t fit in memoryWhat should we cache ?
Grouped query plans What about non-grouped queries ?
Favor small delta files Time window of the event list
System Architecture
CQ processing
Experimental Results
Example query :Where <Quotes><Quote><Symbol>”INTC”</></>element_as $g </> in “quotes.xml”, construct $g
N = number of installed queriesF= number of fired queriesC = number of tuples modified
Performance Results
Case 1: F=N, C=1000Case 2: F=100, C=1000
Performance ResultsF=N=2000, vary data size
Thank You