querying streaming xml data. layout of the presentation introduction common problems faced ...

34
Querying Streaming XML Data

Post on 19-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Querying Streaming XML Data

Page 2: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Layout of the presentation

Introduction Common Problems faced Solution proposed Basic Building blocks of the solution How to build up a solution to a given

query Features of the system

Page 3: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Streaming XML XML – standard for information exchange. Some XML documents only available in

streaming format. Streaming is like reading data from a tape

drive. Used in Stock Market, News, Network

Statistics. Predecessor systems used to filter

documents.

Page 4: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Structure of an XPath Query

Consists of a Location path and an Output Expression (name).

Location path consists of closure axis(//), node test (book) and predicate (year>2000).

e.g. //book[year>2000]/name

Page 5: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Features of our Approach

Efficient Easy to understand design. Design of BPDT is tricky

Page 6: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Page 7: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Page 8: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Page 9: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Page 10: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Buffer both A & B

Page 11: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Failed price<11. Remove

Buffer both A & B

Page 12: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Failed price<11. Remove

Buffer both A & B

Test passed. Output

Page 13: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Page 14: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Fails year=2002

Page 15: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Fails year=2002

Passes year=2002

Page 16: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>7. <book>8. <name> Y </name>9. <author> B </author>10. <pub>11. <book>12. <name> Z </name>13. <author> B </author>14. </book>15. <year> 1999 </year>16. </pub>17. </book>18. <year> 2002 </year>19. </pub>20. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Fails year=2002

Passes year=2002

Lets add author. Result?

Page 17: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Handling XML Stream

Input – well formed XML stream. Use SAX API to parse XML. Events belong to

Begin = {(a, attrs, d)} End = {(/a, d)} Text = {(a, text(), d)}

XML Stream: {e1,e2,…,ei,…} ¦

ei Є Begin υ End υ Text

Page 18: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Grammar for XPath Queries Q N+[/O] N [/¦//] tag [F] F [FO[OP constant]] FO @attribute ¦ tag [@attribute] ¦ text() O @attribute ¦ text() OP > ¦ ≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains

XPath query of the form N1N2…Nn/O

Cant handle Reverse Axis, Positional Functions.

Page 19: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Solution to QueryQuery: /pub[year=2002]/book[price<11]/author

PDA PDT

Page 20: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Basic PushDown Transducer (BPDT)

Similar to PushDown Automata Actions defined on Transition Arcs Finite set of states

A Start state A set of final states

Set of input symbols Set of Stack symbols

Page 21: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Book – Author: Buffer for future: Begin event of Author.

Book – Author: Remove from Buffer: End event of Book.

Book – Author: Output result if predicates true: Begin event of Author.

Building a BPDTQuery: /pub[year>2000]/book[author]/name/text()

Consider location step: /book[author]

Page 22: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Basic Building Blocks

XPath Expression: /tag[child]

Page 23: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Buffer Operations needed Enqueue(x): Add x to the end of the queue.

Clear(): Removes all items from the queue.

Flush(): Outputs all items in the queue in FIFO order.

Upload(): Moves all items to the end of the queue of a parent BPDT.

No Dequeue operation needed.

Page 24: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Basic Building Blocks

XPath Expression: /tag[@attr=val]

Page 25: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Basic Building Blocks

XPath Expression: /tag[text()=val]

Page 26: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Basic Building Blocks

XPath Expression: /tag[child@attr=val]

Page 27: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Basic Building Blocks

XPath Expression: /tag[child=val]

Page 28: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

A sample BPDT

Query: /pub[year>2000]

Page 29: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Building a solutionHPDT for Query:

//pub[year>2000]//book[author]//name/text()

Page 30: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

HPDT Structure Each BPDT in HPDT has:

Position BPDT POSITION (l,K) :- l = depth of BPDT in HPDT, K

= sequence # from right to left BPDT Position (i-1,k) – has right child BPDT position

(i,2k) – connected to NA state BPDT Position(i-1,k) – has left child BPDT position

(I,2k+1) – connected to True state. BPDT Position (i, 2i – 1) – means predicates in higher

level BPDT’s evaluate to trueBuffer – potential resultsStack – stack of elements (SAX) eventsDepth Vector

Page 31: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Example Query

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

rootpub book name

1 2 7 11

1 2 10 11

1 9 10 11

3 paths from $1 to $14

Page 32: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

System Features

Name Support Streaming Multiple

Predicates Closure

Buffered Predicate

Evaluation

XSQ-F XPath X X X X

XSQ-NC XPath X X X

XMLTK XPath X X

XQEngine XQuery X X

Galax XQuery X X

Joost STX X X

Page 33: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Reference Feng Peng and Sudarshan Chawate. XPath Queries

on Streaming Data. In SIGMOD 2003.

Page 34: Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution

Thank You

???