trie indexes for efficient xml query processing

Trie Indexes for Efficient XML Query Processing

Sofia Brenes, Yuqing Wu, Dirk Van Gucht, Pablo Santa Cruz

Indiana University, Bloomington{sbrenesb, yuqwu, vgucht, psantacr}@cs.indiana.edu

XML and Queries – An Example

Query 1: //A/B/CQuery 2: //B/CQuery 3: //A/B[./D]/CQuery 4: //A[./B[./D]]/B/C

B3B2C1

B4A2B1

Index and XML Query EvaluationChallenges Structure

◦Data: containment relationship◦Query:

pattern matching (nested) predicates

Structural Indices for XML DataConsider both value and

structureIndex Features Structural IndicesPure structural summaries

DataGuide, T-index

Local bi-similarity A(k), UD(k,i), D(k), M(k)

Workload-aware D(k), M(k), M*(k)Encoded sequence ViST, Index FabricIndex chooser XIST

Expected Features for an XML Index

Reasonable sizeEasy to construct and adjustQuery evaluation

◦Index-only plan for most queries.

OutlineIntroductionMethodologyPartition induced by structural characteristics

of XMLPartition induced by fragments of XPath

AlgebraCoupling and Block Union TheoremsTrie Indices and Query EvaluationExperimental EvaluationFuture Directions

Rewind – back to the world of RDB

RDBMS Theory

RDBMS Engineering Techniques

Our approachStudy XML query language and its

fragmentsStudy the indistinguishibility of

components in an XML documentsReason about existing XML indicesDesign new XML indices.

OutlineIntroductionMethodologyPartition induced by structural

characteristics of XMLPartition induced by fragments of XPath

XML Data ModelRepresent XML document D as a

finite unordered node-labeled tree

D = (V, Ed, r, )Nodes: VEdges: Ed Root: rLabels:

B3B2C1

B4A2B1

Label Path LP(m,n)

◦LP(m,n) = (A,B,C) LP(n, k)

◦LP(n,0) = (C)◦LP(n, 1) = (B,C)◦LP(n,4) = (A,A,B,C)◦LP(n,7) = (A,A,B,C)

B3B2C1

B4A2B1

N [k] Equivalence

),(),( 212][1 knknnn k LPLPΝ

Given an XML document and value k

B3B2C1

B4A2B1

2]1[1 BB Ν

2]2[1 BB Ν

N [k] Partition),(),( 212][1 knknnn k LPLPΝ

B3B2C1

B4A2B1

N [1] (A)(A,A)(A,B)(B,B)(B,C)(B,D)

{A1}{A2}{B1, B2, B3, B4}{B5}{C1, C2, C3, C4}{D1}

N [1][(A,B)] = {B1, B2, B3, B4}

Label Path

P [k] Equivalence

knmnmnm

nmnm k

),(),(),(),(

221122][11 LP|

Given an XML document and value k

B3B2C1

B4A2B1

),(),( 22]2[11 CACA P

),(),( 41]3[21 CACA P

P [k] Partition A1

B3B2C1

B4A2B1

(A)(B)(C)(D)

{(A1, A1), (A2, A2)}{(B1, B1), (B2, B2), (B3, B3), (B4, B4), (B5, B5)}{(C1, C1), (C2, C2), (C3, C3), (C4, C4)}{(D1, D1)}

(A,A)(A,B)(B,B)(B,C)(B,D)

{(A1, A2)}{(A1, B1), (A2, B2), (A2, B3), (A1, B4)}{(B4, B5)}{(B1, C1), (B2, C2), (B3, C3), (B5, C4)}{(B2, D1)}

P [1][(A,A)] = {(A1, A2)}

P [k] Partition A1

B3B2C1

B4A2B1

(A)(B)(C)(D)

(A,A,B)(A,B,B)(A,B,C)(A,B,D)(B,B,C)

{(A1, B2), (A1, B3)}{(A1, B5)}{(A1, C1), (A2, C2), (A2, C3)}{(A2, D1)} {(B4, C4)}P [2][(A,B,C)] = {(A1, C1), (A2, C2),

(A2, C3)}

OutlineIntroductionMethodologyPartition induced by structural characteristics

XPath Algebra

})(|),{()()(

}|),{()(

lmVmmmDlD

)}().()(),(:|),{()(

)}(),(:|),{()(

DEnwDEwmwnmDEEDEnmnmmDE

Path semantics

Node semantics )}(),(:|{])[( DEnmmnnodesDE

Fragments of XPath Algebra

D algebra XPath algebra - ↑, π1D [ ] algebra XPath algebra - ↑

D [k] algebra D algebra up to length k

D [ ][k] algebra D [ ] algebra up to length k

D [k] Equivalence Given an XML document and

value k and (m1, n1), (m2, n2) in DownPairs(D)

For any E in D [k]

),(),( 22[k]11 nmnm D

)(),()(),( 2211 DEnmDEnm

OutlineIntroduction MethodologyPartition induced by structural characteristics

Coupling TheoremLet D be a document and k is an integer.

◦The P[k]-partition of D and the D[k]- partition of D are the same under the path semantics

◦The N[k]-partition of D and the D[k]-partition of D are the same under the node semantics

][][][][][][

PPΝΝ

B3B2C1

B4A2B1

k-Label-Path SetThe set of label-paths of

length k in an XML document that satisfies an XPath expression in algebra D.

)},,(),,,{()2,(

BBABAAELPS

Label-Union TheoremLet D be a document, k an integer,

and E is an D[k] expression. Then there exists a class of partition blocks of the P[k]-partition (N[k]-partition) of D such that

]][[)(

]][[])[(

kELPSlp

lpknodesDE

Query Evaluation Using Label-Union Theorem

B3B2C1

B4A2B1

C4N [2]

(A)(A,A)(A,B)(A,A,B)(A,B,B)(A,B,C)(B,B,C)(A,B,D)

{A1,}{A2}{B1, B4}{B2, B3,}{B5}{C1, C2, C3} {C4}{D1}

Query 2: //B/CLPS(E,2) = {(A,B,C),

(B,B,C)}

OutlineIntroduction MethodologyPartition induced by structural

characteristics of XMLPartition induced by fragments of XPath

N[k]-Trie Index A1

B3B2C1

B4A2B1

Keep track of the N [k]-partitions

Use the reverse label path as key

{A1,}{A2}{B1, B4}{B2, B3,}{B5}{C1, C2, C3} {C4}{D1}

Query Evaluation with N [k]-Trie IndexA1

B3B2C1

B4A2B1

{A1,}{A2}{B1, B4}{B2, B3,}{B5}{C1, C2, C3} {C4}{D1}

Query 1: //A/B/CLPS(E,2) = {(A,B,C)}

Query Evaluation with N [k]-Trie IndexA1

B3B2C1

B4A2B1

{A1,}{A2}{B1, B4}{B2, B3,}{B5}{C1, C2, C3} {C4}{D1}

Query 2: //B/CLPS(E,2) = {(A,B,C),

(B,B,C)}

P[k]-Trie Index A1

B3B2C1

B4A2B1

Keep track of the P[k]-partitions

Use the reverse label path as key P

[2](A)(B)

(A,A,B)(A,B,B)(A,B,C)(A,B,D)(B,B,C)

{(A1, B2), (A1, B3)}{(A1, B5)}{(A1, C1), (A2, C2), (A2, C3)}{(A2, D1)} {(B4, C4)}

Query Evaluation with P[k]-Trie Index

Query 1: //A/B/CA1

B3B2C1

B4A2B1

Query Evaluation with P[k]-Trie Index

Query 2: //B/CA1

B3B2C1

B4A2B1

Query Evaluation with P[k]-Trie IndexQuery 3: //A/B[./D]/C A1

B3B2C1

B4A2B1

Query Evaluation with P[k]-Trie IndexQuery 3: //A/B[./D]/C A1

B3B2C1

B4A2B1

characteristics of XMLPartition induced by fragments of

XPath AlgebraCoupling and Block Union TheoremsTrie Indices and Query EvaluationExperimental EvaluationFuture Directions

Experimental SetupIndices prototyped in TIMBER

systemReport results on DBLP data

◦127M bytes◦3.3M nodes

Index Sizes

Index Creation Time

Query Evaluation//dblp/inproceedings/title/i/sub

Query Evaluation//dblp/inproceedings[./title[./i]/

sub]/ee

characteristics of XMLPartition induced by fragments of

XPath AlgebraCoupling and Block Union TheoremsTrie Indices and Query EvaluationExperimental EvaluationConclustion

ConclusionP [k]-Trie index is able to facilitate

index-only plan for most queries consistently and significantly outperform N[k]-Trie and A(k)-index.

A modest k value is sufficient for providing significant performance improvements.

Thanks!!Questions?

Research Direction Further study of query decomposition

and inversion algorithmsStudy workload driven index creationDevelop other appropriate index

structures

trie indexes for efficient xml query processing

Documents

mongodb an overview -...

econo me trie

material elèctric trie

query optimization and indexes. introduction relational...

queries and...

slide 1 lecture 6: query processing; hurry up! overview...

multi prefix trie

1 lecture 25 friday, november 30, 2001. 2 outline query...

quadtree and r-tree indexes in oracle spatial: a ... ·...

socio me trie

bitmap indexes for relational xml twig query...

photogram me trie

parkinson trie

#mongodbworld mongodb world new york city,...

try for trie

phrase query optimization on inverted indexes

to trie or not to trie? realizing space-partitioning trees...

uniﬁed query processing for json documents and...

indexes part 2 and query execution - stanford...

query processing & optimization - joyce...