qed: a novel quaternary encoding to completely avoid re-labeling in xml updates changqing li,tok...
TRANSCRIPT
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates
Changqing Li, Tok Wang Ling
2
Outline
• Background and related work
• Our QED encoding
• Completely avoid re-labeling in XML updates based on our QED
• Experiments
• Conclusion
3
Background
• Three main categories of labeling schemes to process XML queries– Containment labeling scheme [Zhang et al SIGMOD01
etc.]
– Prefix labeling scheme [Tatarinov et al SIGMOD02 etc.]
– Prime number labeling scheme [Wu et al ICDE04]
4
(1) Containment Scheme
• “start”, “end”, and “level”
• Determine ancestor-descendant and parent-child relationships based on the containment property
1,16,1
2,3,2 4,9,2 10,13,2 14,15,2
5,6,3 7,8,3 11,12,3
“5,6,3” is a descendant of “1,16,1” because interval [5,6] is contained in interval [1,16]
“5,6,3” is a child of “4,9,2” because interval [5,6] is contained in interval [4,9], and levels 3-2=1
5
(1) Containment Scheme, Containment is bad to process updates
• Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order
1,16,1
4,9,22,3,2 10,13,2 14,15,2
5,6,3 7,8,3 11,12,3
6
(1) Containment Scheme, Containment is bad to process updates
• Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order
2,3,2
1,18,1
4,9,2 10,11,2 12,15,2 16,17,2
5,6,3 7,8,3 13,14,3
• All the red color numbers need to be changed, very expensive
7
(1) Containment Scheme, Approaches to solve the update problem
• Increase the interval size and leave some values unused [Li et al VLDB01]– When unused values are used up, have to re-bel
• Use float-point value [Amagasa et al ICDE03]– Float-point value represented in a computer with a fixed number of
bits– Due to float-point precision, have to re-label
• They both can not completely avoid re-labeling
8
(2) Prefix Scheme
• Determine ancestor-descendant and parent-child relationships based on the prefix property
41 2 3
2.1 2.2 3.1
“2.1” is a descendant of the root, because the label of the root is empty which is a prefix of “2.1”
“2.1” is a child of “2” because “2” is an immediate prefix of “2.1”, i.e. when removing “2” from the left side of “2.1”, “2.1” has no other prefixes.
9
(2) Prefix Scheme,Prefix is bad to process order-sensitive updates
• To maintain the document order when updates are performed ---- order-sensitive updates
• Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings
421 3
2.1 2.2 3.1
10
(2) Prefix Scheme,Prefix is bad to process order-sensitive updates
• To maintain the document order when updates are performed ---- order-sensitive updates
• Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings
1 2 3 4 5
2.1 2.2 4.1
• All the red color numbers need to be changed, very expensive
11
(2) Prefix Scheme,Approaches to solve the update problem
• OrdPath [O'Neil et al SIGMOD04]
– At the beginning, use odd numbers only
1 3 5 7
3.1 3.3 5.1
12
3.1
b d
a
(2) Prefix Scheme,Approaches to solve the update problem
• OrdPath [O'Neil et al SIGMOD04]
– In insertion, use even number together with odd numbers
1 3 5
3.1
Label of node a “-1”
Label of node b “6.1”
Label of node c “6.3”
Label of node d “6.2.1” 5.13.3
7c
• All are at the same level, bad
13
(2) Prefix Scheme,Problems of OrdPath
• Nodes a, b, and c are at the same level, but their labels “-1”, “6.1”, and “6.3” do not look like this; need more time to determine this; will decrease the query performance
• Waste half numbers (even numbers); will make label size increase
• Need to calculate the even number between two odd numbers; update cost not cheap
• Use a fixed length size to indicate the size of a label, the fixed length size field will eventually encounter the overflow problem when a lot of nodes are inserted, so OrdPath can not completely avoid re-labeling
14
(3) Prime scheme
• Based on a top-down approach, each node is given a unique prime number (self_label) and the label of each node is the product of its parent node’s label (parent_label) and its own self_label.
• Query – Use the modular and division operations to determine the
ancestor-descendant and ordering relationships, which are very expensive
• Update– When nodes are inserted into the XML tree, needs to re-calculate
the SC values, which is much more expensive than re-labeling
• Details can be found in [Wu et al ICDE04]
15
Our QED encoding
• Dynamic Quaternary Encoding (QED)
• Four quaternary numbers “0”, “1”, “2” and “3” are used in the code and each number is stored with two bits, i.e. “00”, “01”, “10” and “11”.
• The quaternary number “0” is used as the separator, and only “1”, “2”, and “3” are used in the QED encoding.– Compare QED codes based on the lexicographical
order
16
Example about QED
• We show how to encode 16 numbers; we choose 16 because the total “start” and “end” values in the containment scheme is 16; this is only an example
• Any other number is ok to be encoded by our QED• Every time encode the (1/3)th and (2/3)th numbers
between two numbers– “0” is the separator, and only “1”, “2”, and “3” appear in the QED
codes, so (1/3)th and (2/3)th
1,16,1
2,3,2 4,9,2 10,13,2 14,15,2
5,6,3 7,8,3 11,12,3
17
Example about QED
Decimal number FixedLength VarLength QED Position
1 00001 1
2 00010 10
3 00011 11
4 00100 100
5 00101 101
6 00110 110 2 (1/3)th position = 6 = round(0+(17-0)/3)
7 00111 111
8 01000 1000
9 01001 1001
10 01010 1010
11 01011 1011 3 (2/3)th position = 11 = round(0+(17-0)*2/3)
12 01100 1100
13 01101 1101
14 01110 1110
15 01111 1111
16 10000 10000
0
17
18
Example about QED
Decimal number FixedLength VarLength QED Position
1 00001 1
2 00010 10 12 (1/3)th position = 2 = round(0+(6-0)/3)
3 00011 11
4 00100 100 13 (2/3)th position = 4 = round(0+(6-0)*2/3)
5 00101 101
6 00110 110 2 (1/3)th position = 6 = round(0+(17-0)/3)
7 00111 111
8 01000 1000 22 (1/3)th position = 8 = round(6+(11-6)/3)
9 01001 1001 23 (2/3)th position = 9 = round(6+(11-6)*2/3)
10 01010 1010
11 01011 1011 3 (2/3)th position = 11 = round(0+(17-0)*2/3)
12 01100 1100
13 01101 1101 32 (1/3)th position = 13 = round(11+(17-11)/3)
14 01110 1110
15 01111 1111 33 (2/3)th position = 15 = round(0+(17-11)*2/3)
16 10000 10000
0
17
19
Example about QED
Decimal number FixedLength VarLength QED Position
1 00001 1 112 (1/3)th position = 1 = round(0+(2-0)/3)
2 00010 10 12 (1/3)th position = 2 = round(0+(6-0)/3)
3 00011 11 122 (1/3)th position = 3 = round(2+(4-2)/3)
4 00100 100 13 (2/3)th position = 4 = round(0+(6-0)*2/3)
5 00101 101 132 (1/3)th position = 5 = round(4+(6-4)/3)
6 00110 110 2 (1/3)th position = 6 = round(0+(17-0)/3)
7 00111 111 212 (1/3)th position = 7 = round(6+(8-6)/3)
8 01000 1000 22 (1/3)th position = 8 = round(6+(11-6)/3)
9 01001 1001 23 (2/3)th position = 9 = round(6+(11-6)*2/3)
10 01010 1010 232 (1/3)th position = 10 = round(9+(11-9)/3)
11 01011 1011 3 (2/3)th position = 11 = round(0+(17-0)*2/3)
12 01100 1100 312 (1/3)th position = 12 = round(11+(13-11)/3)
13 01101 1101 32 (1/3)th position = 13 = round(11+(17-11)/3)
14 01110 1110 322 (1/3)th position = 14 = round(13+(15-13)/3)
15 01111 1111 33 (2/3)th position = 15 = round(0+(17-11)*2/3)
16 10000 10000 332 (1/3)th position = 16 = round(15+(17-15)/3)
0
17
20
Overflow problem of other methods
• In the previous page, we can see that the FixedLenth codes are stored with length 5, i.e. the length of each code is 5 bits
• When a lot of codes are inserted, the length 5 is not large enough, all the FixedLength codes need to be changed.
• For the VarLength codes, we also need to store the length of each VarLength code, e.g., the length of “10000” is 5. We need to store this 5 using fixed length of bits (“101”; 3 bits). The sizes of other codes should also be stored using fixed length of bits (3 bits).
• When a lot of codes are inserted, this size of the size field 3 is not large enough, then all the codes must be changed
• This is called the overflow problem.
21
Our QED use “0” to separate different codes ---- will never encounter the overflow problem
• For the QED codes “112”, “12”, and “122” etc. in the table, they are separated with “0”
• Stored as “11201201220”, based on the separator “0”, we can separate different codes
• “0” will never encounter the overflow problem
• Our QED encoding can help to completely avoid the re-labeling
22
Lexicographical order for our QED
• Our QED compares codes based on the lexicographical order
• The QED codes in the table are lexicographically ordered from top to bottom. – E.g., “132” < “2” lexicographically because the
comparison is from left to right, and the 1st symbol of “132” is “1”, while the 1st symbol of “2” is “2”.
– Another example, “23” < “232” lexicographically because “23” is a prefix of “232”.
23
(a) Applying QED encoding to the containment scheme
• Replace the “start” and “end” values “1” to “16” with our QED codes
• A QED encoding based on containment scheme is formed
• Compare labels based on lexicographical order
112,332
12,122 13,23 232,32 322,33
132,2 212,22 3,312
• Note that we drop the level values from the right graph just for a clear presentation
24
(b) Applying QED encoding to the prefix scheme
• The root has 4 children. To encode 4 numbers based on our QED, the codes will be “12”, “2”, “3” and “32”.
• Similarly if there are 2 siblings, their self_labels (last component, e.g., “3” in “2.3” is the self_label) are “2” and “3”.
• If there is only 1 sibling, its self_label is “2”.
3212 2 3
2.2 2.3 3.2
25
(b) Processing the delimiters of the prefix scheme based on our QED
• For the prefix scheme, the delimiter “.” can not be stored together with the numbers in the implementation to separate different components.
• For our QED encoding, we use the following approach to process the delimiters. – We use one “0” as the delimiter to separate different
components of a prefix label• e.g. separate “12” and “3” in “12.3”; the delimiter “0” is
equivalent to the “.”; “12.3” is stored as “1203” in the implementation;
– use two consecutive separators “00” as the separator to separate different labels
• e.g. “1202001203” represents 2 labels, i.e. “1202” and “1203”.
26
Algorithm for insertion based on QED
Algorithm: GetInsertedCodeInput: Left_Code, Right_CodeOutput: Inserted_Code, such that Left_Code < Inserted_Code < Right_Code lexicographically.
1: get the sizes of Left_Code and Right_Code2: if size(Left_Code) < size(Right_Code) //Case (1)3: then Inserted_Code = (the Right_Code with the last4: symbol changed to “1”) concatenate “2”5: else if size(Left_Code) > size(Right_Code)6: if the last symbol of Left_Code is “2” //Case (2)7: then Inserted_Code = the Left_Code with the8: last symbol changed from “2” to “3”9: else if the last symbol of Left_Code is “3” //Case (3)10: then Inserted_Code = Left_Code concatenate “2”11: else if size(Left_Code) = size(Right_Code) //Case (4)12: then Inserted_Code = Left_Code concatenate “2”
27
XML updates based on our QED–containment
• When we insert a node as shown in the below figure• We should insert two QED codes between “23” and “232”
– First create the “start” value• i.e. a code between “23” and “232”, the new code is “2312”; • see Case (1) of the GetInsertedCode algorithm;
– Then create the “end” value• i.e. a code between “2312” and “232”, the new code is “2313”; • see Case (2) of the GetInsertedCode algorithm;
• “23” < “2312” < “2313” < “232” lexicographically, we need not re-label any existing nodes.
112,332
13,2312,122 232,32 322,33
132,2 212,22 3,312
28
XML updates based on our QED – based on prefix scheme
• When we insert a node as shown in the below figure• We should insert one QED code between “2” and “3”
– The new QED code between “2” and “3” is “22”;
– see Case (4) of the GetInsertedCode algorithm;
• “2” < “22” < “3” lexicographically, we need not re-label any existing nodes, but we can keep the order.
12 2 22 3 32
202 203 302
29
Experimental results – Experimental setup
• We mainly report the results in updates• We select the Hamlet file in Shakespeare’s play
dataset• Intermittent updates
– Hamlet file has 5 act elements, 6 insertion cases, i.e. before act[1], between act[1] and act[2], …, between act[4] and act[5], and after act[5].
• Uniformly frequent updates– Insertions happens randomly at different places of the
Hamlet file
• Skewed frequent updates– Insertions always happen at a fixed place of the Hamlet file
30
Experimental results – intermittent updates
• Prime needs to re-calculate less SC values, but its re-calculation time is very large
• Theorem. Our QED never needs to re-label any existing nodes
• The update time of our QED is much smaller
• The update performance differences among OrdPath, Float-point, and our QED can be seen in the next page
• Note that QED represents both the QED encoding and the QED-containment scheme, QED-PREFIX represents the scheme when we apply QED encoding to the prefix scheme.
0
1000
2000
3000
4000
5000
6000
7000
1 2 3 4 5 6
Insertion cases
Nu
mb
er o
f n
od
es t
o r
e-la
bel
in
h
ori
zon
tal
up
dat
e
Prime
OrdPath1
OrdPath2
QED-PREFIX
Float-point
FixedLength
VarLength
QED
3.6165.2296.1486.841 2.093
0
0.01
0.02
0.03
0.04
1 2 3 4 5 6
Insertion cases
Ho
rizo
nta
l u
pd
ate
tim
e (s
eco
nd
s)
Prime
OrdPath1
OrdPath2
QED-PREFIX
Float-point
FixedLength
VarLength
QED
(a) Number of nodes to re-label
(b) Time to re-label
31
Experimental results – uniformly frequent updates
• When uniformly frequent updates are performed,– The update time of OrdPath
and Float-Point is much larger (more than 386 times) than the time required by our QED approaches
• Our QED encoding only needs to modify the last 2 bits of the neighbor label, which is very cheap
• Both OrdPath and Float-point can not completely avoid re-labeling
0
100
200
300
400
500
600
0 150000 300000 450000
Number of nodes inserted
Up
dat
e ti
me
for
bal
ance
d t
iny
inse
rtio
ns
(sec
on
ds)
OrderPath1
OrderPath2
QED-PREFIX
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 150000 300000 450000
Number of nodes inserted
Up
dat
e ti
me
for
bal
ance
d t
iny
inse
rtio
ns
(sec
on
ds)
Float-point
QED
(a) OrdPath1&2 vs QED-PREFIX
(b) Float-point vs QED
32
Experimental results – skewed frequent updates
• When skewed frequent updates are performed,– The update time of OrdPath and
Float-Point is much larger (more than 8126 times) than the time required by our QED approaches
• The very large update time makes OrdPath and Float-point unsuitable to answer queries in the frequent insertion environment.
• Our QED still works the best to answer queries in the environment that frequent insertions are executed
0
2
4
6
8
10
12
0 50 100 150 200
Number of nodes inserted
Up
dat
e ti
me
wit
h r
e-la
bel
ing
fo
r sk
ewed
tin
y in
sert
ion
s (s
eco
nd
s)
OrdPath1
OrdPath2
QED-PREFIX
(a) OrdPath1&2 vs QED-PREFIX
(b) Float-point vs QED
0
30
60
90
120
150
0 50 100 150 200
Number of nodes inserted
Up
dat
e ti
me
wit
h r
e-la
bel
ing
fo
r sk
ewed
tin
y in
sert
ion
s (s
eco
nd
s)Float-point
QED
33
Conclusion
• We propose the QED encoding
• QED can be applied broadly to different labeling schemes
• QED can completely avoid re-labeling in XML updates