bitmap indexes for relational xml twig query processing
DESCRIPTION
The slides I presented at CIKM'09TRANSCRIPT
Kyong-Ha Lee and Bongki MoonThe University of Arizona
Bitmap Indexes For Relational XML Twig Query Processing
CIKM'09, Hong Kong 2
XML Data and Queriesa1
a2
b1 c1
d1 e1
a3
b2 d2
c2e2
a4
b3 e3
c3 d3
(1, 32,1)
(2,11,2)
(3,4,3) (5,10,3)
(6,7,4) (8,9,4)
(12,21,2)
(13,16,3) (17,20,3)
(18,19,4)(14,15,4)
(22,31,2)
(23,28,3) (29,30,3)
(24,25,4) (26,27,4)
0
1
2 3
4 5
6
7
108
9
11
12
13 14
15
<a> <a> <b>t1</b> <c> <d>t2</d> <e>t3</e> </c> </a> <a> <b> <e>t4</e> </b> <d> <c>t5</c> </d> </a>. . . . .</a>
A
E
C
B
A
CB
//A/B/C
//A[//B]//C
//A[./B/C]//E
A
B
C
CIKM'09, Hong Kong 3
XML Stored in RDBtagName start end level value pathId
a11 31 1 - 0
a22 11 2 - 1
b13 4 3 t1 2
c15 10 3 - 3
d16 7 4 t2 4
e18 9 4 t3 5
a312 21 2 - 1
b213 16 3 - 2
e214 15 4 t4 6
NODE tablepathId pathString
0 A#
1 A##A#
2 B##A##A#
3 C##A##A#
4 D##C##A##A#
5 E##C##A##A#
6 E##B##A##A#
7 D##A##A#
8 C##D##A##A#
9 C##B##A##A#
10 D##B##A##A#
11 E##A##A#
PATH table
e329 30 3 - 11
. . .
. . .
. . .
CIKM'09, Hong Kong 4
To answer a twig query A twig pattern is decomposed into
several path patterns. Path solutions are joined together to
compose a final result.
Holistic Twig Join(HTJ) algorithm Specialized multi-way& sort-merge
join guarantees I/O optimality for a cer-
tain subset of XML query.The optimality depends on how the
elements are partitioned. uses stacks and streams in which el-
ements are sorted in an order.
Twig Join
a1 a2 a3 a4
b1 b2 b3
c1 c2 c3
d1 d2 d3
e1 e2 e3
StreamsStacks
SA
SBSE
A
E
C
B
A
B
C
A
E
A
SC
CIKM'09, Hong Kong 5
Discrepancy between XML in RDB and conventional HTJ algorithmsLogical: Streams vs. TablePhysical: partitioned vs. record-orientedSupporting actual data including a large volume of texts
requires references to records.How to feed tuples to HTJ algorithm?What’s the best partitioning scheme for XML stored in
RDB?
Bitmap index, a conventional index in RDBMSAn efficient way to indicate tuples.Efficient support for logical operationsCan we use the bitmap index for supporting HTJ?
Motivation
CIKM'09, Hong Kong 6
Tag-based partitioningSimple, and skipping technique can be used to
read useful elements only. For a query node, only one stream is accessed
Tag+Level partitioningMore I/O optimality, suitable for deep XMLSome streams may be accessed for a single query
node Path-based partitioning
More I/O optimality, suitable for shallow XMLA path with //-axes may require accessing many
streams for a single query node
HTJ on Different Partitioning Schemes
CIKM'09, Hong Kong 7
How to partition tuples in NODE ta-ble By building a bitmap index on certain
column(s) in the table.bitTag for tagName, bitTag+ for (tagName, Level), bitPath for pathId column
Determines I/O optimality of holistic twig join algorithms.
During twig join process, useful tu-ples are accessed via the bitmap index.
Bitmap Index
1100001000
0010000100
0000010000
. . .
. . . B
it-vecto
rs
A B E
disk blocks
CIKM'09, Hong Kong 8
bitAnc : A bit-vector represents terminal elements corr. to a certain path and all their ancestors.
bitDesc: A bit–vector represents terminal elements corr. to a certain path and all their descendants.
Additional Indexes
0010000100001000
1110001100011000
0123456789
101112131415
0010000110001110
(a) bitPath, bitAnc, and bitDesc for PathId=2, i.e. /A/A/B
a1
a2
b1
a3
b2
e2
a4
b3
c3 d3
0
1
2
6
7
8
11
12
13 14
(b) A subtree covered by the left 3 bit-vectors
CIKM'09, Hong Kong 9
Basic indexBit-vectors are built on a single column or a
group of columnsRequires labeled values, and reading records
Hybrid indexA Combination of two different indexesdescTag : bitDesc & bitTagbitTwig : bitPath & bitAnc
does not require labeled values to compute twig solution
Two Types of Indexes
CIKM'09, Hong Kong 10
Identifying Element Rela-tionship with Bit-vectors
•For a query //A//B, can the pairs (a1, b1) and (a2, b2) be solution?
1110001100011000
1000000000000000
1100001000010000
a2a1
b1
P2: /A/A/B P0: /A P1: /A/A
b2
a1
a2
b1
a3
b2
a4
b3
0
1
2
6
7
11
12
0123456789
101112131415
CIKM'09, Hong Kong 11
Choose the minimum position value among the current 1’s as a current el-ement for a query node
Check if 1 exists in an interval, pos(a) and pos(d)?looking-ahead at the next 1
Advancing Cursors
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
P0 : /A
P1 : /A/A
0
Current1 Next1
Currq
eov
1 6
(0,0,1)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
q : //A
CIKM'09, Hong Kong 12
Early detection with a bit-vector ab-sence
Condensing query nodes For path-based partition Reduces |INDEX| and |RECORD|
Skipping reading obsolete records with advance(k) For tag, (tag, level)-based partition Reduces |RECORD|
Moving cursors over compressed bit-vectors with no decompression A composite cursor moving over a bit-vec-
tor compressed by run-length encoding scheme
Reduces |INDEX|
Optimizations
A
E
C
B
A
EC
P: //A/B/C
10000000000100000
00001000010000100
CA = 11
CB = 4 advance(11)
CIKM'09, Hong Kong 13
Compressed Bit-vector000100000000100000000000000011 00000000000 . . . 00000000000000 0000000000000000000000000000001 00
(a) An original bit-vector with 8,000 bits
000010…010…011 100… 0100000000 000…001
(b) Grouping as a unit of 31 bits and Merging identical groups
31 bits 2 bits256* 31 bits31 bits
000…000
(c) Encoding each group as 1 word (4byte on a 32-bit machine)
Uncompressed word Compressed word
Run-length is 256 Remainingword
31 literal bits
Cursor C ={ C.position, //Integer position value (Logical address) C. word, // The current word C is located at. C.bit, // The position of the bit C is visiting, in C.word C. rest } //The bit position in the remaining word
CIKM'09, Hong Kong 14
Moving A Cursor over A Com-pressed Bit-vector
000010…010…011 100… 0100000000 000…001 000…000
Run-length is 256Remaining
word
C = {31, 0, 31,0}
a) Get the position of the next 1
Skip to examine 31* 256 bits
C={7998, 2, 31, 0}
b) Check a bit value at the position 3,000
000010…010…011 100… 0100000000 000…001 000…000
C = {31, 0, 31,0}with distance to move, 2,869=(3000-31)
Since 31* 256 > 2,869,The bit we find is within the word 1.
CIKM'09, Hong Kong 15
Experiments
Datasets Synthetic : XMarkReal : DBLP, Treebank, Swiss-prot
Query sets
CIKM'09, Hong Kong 16
Statistics of Dataset and Indexes
•# of distinct paths really varies
•# of distinct tag names are not much different
•Index build time is largelyaffected by attribute cardinality
•Index size is smaller than labeled value size in most cases
CIKM'09, Hong Kong 17
Query Execution Time
CIKM'09, Hong Kong 18
Input Data Size
CIKM'09, Hong Kong 19
Merging used bit-vectors for a path pattern with //-axes and putting it into a bitmap index for the next timefor a given path //A//B, P:/A/A/B P:/A/Bacts like a pre-computed join indexA path pattern with //-axes can be repre-
sented by a single bit-vector. Logical operations: OR, NOT
are simply supported by bitwise-logical operations: &, |, ^
Other Features on bitPath
CIKM'09, Hong Kong 20
Twig Queries with Logical Opera-tions
//A[./B/C or ./B/D]//E
A
E
(C|D)
B
//A[./B/not(C)]//E
A
E
¬ C
B
A
E
X
B
A A P//A,P//A//B//X ≡P//A//B//C V P//A//B//D ,P//A//E
A
E
C
B
A AA
B
P//A ,P//A//E ,P//A/B (Pⓧ //A/B ⊙A//A/B/C)
CIKM'09, Hong Kong 21
We investigated the possibilities of bitmap indexes for XML query processing Partitioning XML stored in RDB in various ways Cursor movements do not require decompression of bit-
vectors We devised a way to identify element relationship
with only bitmap index, bitTwig Our experiments showed that bitTwig was best for
queries against shallow XML documents For deep XML documents, bitTag/w advance(k)
showed the best performance. Future work: evaluating our system with more HTJ al-
gorithms and other indexes
Conclusions
Thanks! Questions?