niagaracq: a scalable continuous query system for internet databases

42
1 NiagaraCQ: A Scalable Continuous Query System for Internet Databases CS561 Presentation Xiaoning Wang

Upload: uma-sanford

Post on 01-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

NiagaraCQ: A Scalable Continuous Query System for Internet Databases. CS561 Presentation Xiaoning Wang. Presentation Outline. Why NiagaraCQ? What’s NiagaraCQ? Details Experiments. We need Continuous Queries. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

1

NiagaraCQ: A Scalable Continuous Query System for Internet Databases

CS561 Presentation

Xiaoning Wang

Page 2: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

2

Presentation Outline

Why NiagaraCQ?

What’s NiagaraCQ?

Details

Experiments

Page 3: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

3

We need Continuous Queries

Allows users to obtain new results from a database without having to issue the same query repeatedly.

Especially useful in an environment like the Internet comprised of large amounts of frequently changing information.

Need to be able to support millions of queries due to the scale of the Internet.

Page 4: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

4

What’s NiagaraCQ?

A distributed database system for querying distributed XML data sets using a query language like XML-QL.

It allows a very large number of users to be able to register continuous queries in a high-level query language such as XML-QL.

Page 5: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

5

Anything else?

Novelty:

query optimization strategy

Page 6: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

6

What’s the Approach

Group!!! Assumption

?

Page 7: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

7

Advantages of Grouping

Group optimization has the following benefits: Grouped queries can share computations. Common execution plans of grouped queries

can reside in memory, significantly saving on I/O costs compared to executing each query separately.

Grouping makes it possible to test the “firing” conditions of many continuous queries together, avoiding unnecessary invocations.

Page 8: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

8

Can this traditional grouping tech be applied

into Continuous Query directly?

NO!!!

Page 9: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

9

Why not?

Previous group optimization efforts focused on finding an optimal plan for a small number of similar queries.

Not applicable to a continuous query system for the following reasons:

Computationally too expensive to handle a large number of queries.

Not designed for an environment like the web.

Page 10: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

10

Incremental group optimization approach

Uniqueness of NiagaraCQ

Page 11: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

11

NiagaraCQ

Supports scalable continuous query processing over multiple, distributed XML files by deploying incremental group optimization.

Page 12: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

12

It’s a novel approach in which queries are grouped according to their signatures.

Groups are created for existing queries according to their signatures, which represent similar structures among the queries.

Groups allow the common parts of two or more queries to be shared.

Each individual query in a query group shares the results from the execution of the group plan.

Incremental group optimization

Page 13: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

13

Cont.

When a new query arrives, existing groups are considered as possible optimization choices instead of re-grouping all the queries in the system.

New query is merged into existing groups whose signatures match that of the query.

Existing queries are not re-grouped

Page 14: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

14

query-split scheme

Incremental group optimization scheme employs a query-split scheme.

After the signature of a new query is matched, the sub-plan corresponding to the signature is replaced with a scan of the output file produced by the matching group.

Optimization process then continues with the remainder of the query tree is a bottom-up fashion until the entire query has been analyzed.

If no group “matches” a signature of the new query, a new query group for this signature is created in the system.

Thus, each continuous query is split into several smaller queries such that inputs of each of these queries are monitored using the same techniques that are used for the inputs of user-defined continuous queries.

Page 15: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

15

Cont.

Advantages for query-split scheme: Main advantage is that it can be

implemented using a general query engine with only minor modifications.

Anther advantage is that the approach is very scalable.

Page 16: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

16

Cont.

Since queries are continuous being added and removed from groups, over time the quality of the group can deteriorate, leading to a reduction in the overall performance of the system.

One or more groups may require “dynamic regrouping” to re-establish their effectiveness.

Page 17: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

17

Types of Continuous Queries

Change-base continuous queries Fired as soon as new relevant data becomes

available. i.e. In stocks, day-traders would probably

want to know the desired price information immediately.

Provide better response time, but waste system resources when instantaneous answers are not really required.

Page 18: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

18

Types of Continuous Queries

Timer-based continuous queries Executed only as time intervals specified by the

submitting user. i.e. In stocks, long-term investors may be satisfied

being notified every hour. Can be supported more efficiently, thus query

systems that support these continuous queries should be more scalable.

However, since users can specify various overlapping time intervals for their continuous queries, grouping timer-based queries is much more difficult than grouping purely change-based queries.

Page 19: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

19

NiagaraCQ (cont.)

Considering only the changed portion of each updated XML file and not the entire file.

Monitor and detect data source changes using both push and pull models on heterogeneous sources.

Caching mechanism is used to obtain good performance with limited amounts of memory.

Page 20: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

20

Using Expression Signature

Represent the same syntax structure, but possibly different constant values, in different queries.

Specific implementation of the signature concept.

Page 21: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

21

Group Group signature

Quotes.Quote.Symbolin quotes.xml

constant

=

Page 22: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

22

Group Group constant table

…. ….INTC Dest.i

MSFT Dest.j

…. ….

Constant_value Destination_buffer

Page 23: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

23

Group plan Previous query plan

Quotes.xml Quotes.xml

File Scan File Scan

Select Symbol=“INT

C”

Select Symbol=“INT

C”

Trigger Action I Trigger Action J

Page 24: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

24

Group plan New plan

Quotes.xml Constant Table

File Scan File Scan

Join

Trigger Action I Trigger Action J

Symbol=Constant_value

Split

……

Page 25: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

25

Buffer design

The destination buffer for the split operator is needed.

Pipelined scheme

Intermediate scheme

Page 26: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

26

Pipelined scheme

Split

Operator

…….

buffer

Operator

buffer

Page 27: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

27

Disadvantage of pipeline

Such a scheme doesn’t work for grouping timer-based CQ’s. It’s difficult for a split operator to determine which tuple should be stored and how long they should be stored for.

The query structure is a directed graph, not a tree and hence the plan may be too complicated for a general XML-QL query engine to execute

The combine plan may be very large requires resources beyond the limits of system.

A large portion of the query plan may not need to be executed at each query invocation.

One query may block many other queries.

Page 28: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

28

Materialized Intermediate Files

Query Split with Materialized Intermediate Files

Split

Trig.Act.J

File Scan

File Scan

Trig.Act.J

….

File_i File_j

Page 29: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

29

Advantages for this design

Every query is scheduled independently. Only the necessary queries are executed.

Queries after the Split operator will be in standard, tree-structured query format and thus can be scheduled and executed by a general query engine.

Each query in the system is about the size of a common user query, so that it can be executed without consuming an unusual amount of system resources.

This approach handles intermediate files and original data files uniformly.

The potential bottleneck of pipelined approach is avoided.

Page 30: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

30

General Selection Predicates

Format: “Attribute op Constant” Attribute is a path expression without wildcards in it. Op includes “-”, “<”, “>”, because these formats

dominate in the selection queries.

Page 31: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

31

Problem for range-query groups

One potential problem for range-query groups is that the intermediate files may contain a large number of duplicated tuples because range predicated of the different queries might overlap.

Page 32: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

32

Solution: Virtual Intermediate Files

Split

Trig.Act.J

File Scan

File Scan

Trig.Act.J

….

Virtual Intermediate File_i VI File_i

Real Intermediate File

……

Only one

Page 33: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

33

Join Operators Join operators are usually expensive, sharing

common join operations can significantly reduce the amount of computation.

Page 34: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

34

Join first? Or selection first? Advantage

Disadvantage

Solution: choose a better one based on a cost model

Page 35: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

35

Timer-based Continuous Queries

Grouped in the same way as change-based queries except that the time information needs to be recorded at installation time.

Challenges Hard to monitor the timer events of those queries. Sharing the common computation becomes

difficult due to the various time intervals. Timer-based continuous queries fires at

specific times, but only if the corresponding input files have not been modified.

Page 36: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

36

Incremental Evaluation

Incremental evaluation allows queries to be invoked only on the changed data.

For each file, on which CQ’s are defined, NiagaraCQ keeps a “delta file” that contains recent changes.

Queries are run over the delta files whenever possible instead of their original files.

A time stamp is added to each tuple in the delta file. NiagaraCQ fetches only tuples that were added to the delta file since the query’s last firing time.

Page 37: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

37

Memory Caching

Caching is used to obtain good performance with a limited amount of memory.

Caches query plans, system data structures, and data files for better performance.

Page 38: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

38

What should be cached? Grouped query plans. We assume

that the number of query groups is relatively small.

Recently accessed file. The event list for monitoring the

timer-based events. But it can be large, so we keep only a “time window” of this list

Page 39: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

39

Experimental Results

Hardware Settings Conducted on a Sun Ultra 6000 with 1GB of

RAM, running JDK1.2 on Solaris 2.6. Data Sets

Database of stock information consisting of two XML files, “quotes.xml” and “companies.xml”.

“quotes.xml” contains stock information on about 5000 NASDAQ companies. The size of the file is about 2 MB.

“companies.xml” contains related company information. The size of the file is about 1 MB.

Page 40: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

40

Comparison

Page 41: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

41

Comparison

Page 42: NiagaraCQ:  A Scalable Continuous Query System for Internet Databases

42

System Status and Future Work

A prototype version of NiagaraCQ has been developed, which includes a Group Optimizer, Continuous Query Manager, Event Detector, and Data Manager.

Incremental group optimizer is still at a preliminary stage. However, incremental group optimization has been shown to be a promising way to achieve good performance and scalability.

Intend to extent incremental group optimization to queries containing operators other than selection and join (i.e. aggregation).