Web Data Management
Indexes
In this lecture• Indexes
– XSet
– Region algebras
– Indexes for Arbitrary Semistructured Data
– Dataguides
– T-indexes
– Index Fabric
Resources• Index Structures for Path Expressions by Milo and Suciu, in ICDT'99• XSet description: http://www.openhealth.org/XSet/
• Data on the Web Abiteboul, Buneman, Suciu : section 8.2
The problem
• Input: large, irregular data graph
• Output: index structure for evaluating regular path expressions
The Data
Semistructured data instance = a large graph
The queriesRegular expressions (using Lorel-like syntax)
SELECT XfROM (Bib.*.author).(lastname|firstname).Abiteboul X
Select xfrom part._*.supplier.name x
Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression.
Select XFrom part._*.supplier: {name: X, address: “Philadelphia”}
Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.
Analyzing the problem
• what kind of data– tree data (XML): easier to index – graph data: used in more complex applications
• what kind of queries– restricted regular expressions (e.g. XPath): may
be more efficient– arbitrary regular expressions: rarely
encountered in practice
XSet: a simple index for XML
• Part of the Ninja project at Berkeley• Example XML data:
XSet: a simple index for XML
Each node = a hashtable
Each entry = list of pointers to data nodes (not shown)
XSet: Efficient query evaluation
• To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name.
• R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name.
• Thus, explore the entire subtree dominated by h2.• Will be efficient if index is small and fits in memory• R3 – leading wild card forces to consider all nodes in the index tree,
resulting in less efficient computation than for R4.• Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal
search from there.
(R1) SELECT X FROM part.name X -yes
(R2) SELECT X FROM part.supplier.name X -yes
(R3) SELECT X FROM *.supplier.name X -maybe
(R4) SELECT X FROM part.*.subpart.name X -maybe
Region Algebras• structured text = text with tags (like XML)
• powerful indexing techniques[Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc.]
• New Oxford English Dictionary
• critical limitation:ordered data only (like text)
• Assume: data given as an XML text file, and implicit ordering in the file.
• less critical limitation: restricted regular expressions
Region Algebras: Definitions• data = sequence of characters [c1c2c3 …]
• region = segment of the text in a file– representation (x,y) = [cx,cx+1, … cy], x – start position, y –
end position of the region– example: <section> … </section>
• region set = a set of regions s.t. any two regions are either disjoint or one included in the other– example all <section> regions (may be nested)– Tree data – each node defines a region and each set of nodes
define a region set.– example: region p2 consisting of text under p2, set {p2,s2,s1}
is a region set with three regions
Representation of a region set
• Example: the <subpart> region set:
• region algebra = operators on region set, ss11 op s op s22 defines a new region set
Region algebra: some operators
• s1 intersect s2 = {r | r s1, r s2}
• s1 included s2 = {r | rs1, r´ s2, r r´}
• s1 including s2 = {r | r s1, r´ s2, r r´}
• s1 parent s2 = {r | r s1, r´ s2, r is a parent of r´}
• s1 child s2 = {r | r s1, r´ s2, r is child of r´}
Examples:
<subpart> included <part> = { s1, s2, s3, s5}
<part> including <subpart> = {p2, p3}
<name> child <part> = {n1, n3, n12}
Efficient computation of Region Algebra Operators
Example: s1 included s2
s1 = {(x1,x1'), (x2,x2'), …}
s2 = {(y1,y1'), (y2,y2'), …}
(i.e. assume each consists of disjoint regions)
Algorithm:if xi < yj then i := i + 1
if xi' > yj' then j := j + 1
otherwise: print (xi,xi'), do i := i + 1
Can do in sub-linear time when one region is very small
From path expressions to region expressions
• Use region algebra operators to answer regular path expressions:
• Only restricted forms of regular path expressions can be translated into region algebra operators – expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene closure *.
Region expressions correspond to simple XPath expressions
part.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))
From path expressions to region expressions
• Answering more complex queries:
• Translates into the following region algebra expression:
• “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text.
• Such a region can be computed dynamically using a full text index.
• Region expressions correspond to simple XPath expressions
Select XFrom *.subpart: {name: X, *.supplier.address: “Philadelphia”}
Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))
Indexes for Arbitrary Semistructured Data
• A semistructured data instance that is a DAG
Indexes for Arbitrary Semistructured Data
• The data represents employees and projects in a company.• Two kinds of employees – programmers and statisticians• Three kinds of links to projects – leads, workson, consultants• Index graph – reduced graph that summarizes all paths from root in the data
graph• Example: node p1 – paths from root to p1 labeled with the following five
sequences:
ProjectEmployee.leadsEmployee.worksonProgrammer.employee.leadsProgrammer.employee.workson
• Node p2 – paths from root to p2 labeled by same five sequences• p1 and p2 are language-equivalent
Indexes for Arbitrary Semistructured Data
• For each node x in the data graph,
Lx = {w| a path from the root to x labeled w}
x,y x y Lx = Ly
[x] = {y | x y }
Nodes(I) = {[x] | x nodes(G)
I =
Edges(I) = {[x] [y] | x [x], y [y], x y } a a
Indexes for Arbitrary Semistructured Data
• We have the following equivalences:e1 e2e3 e4 e5p1 p2p3 p4p5 p6 p7
Indexes for Arbitrary Semistructured Data
• Computing path expression queries– Compute query on I and obtain set of index nodes– Compute union of all extents
• Returns nodes h8, h9.• Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8]• Always: size(I) size(G)• Efficient when I can be stored in main memory• Checking x y is expensive.
Select XFrom statistician.employee.(leads|consults): X
Indexes for Arbitrary Semistructured Data
Use bisimulation instead of Fact: x, y x b y x y
Use the same construction, but [u] now refers to b instead of .
Bisimulation: Let DB be a data graph. A relation is a bisimulation on the reversed graph (i.e. all edges have their direction reversed) if the following conditions hold:
1. If x y and x is a root, then so is y.
2. Conversely, if x y and y is a root, then so is x.
3. If x y, then for any edge x x there exists an edge y y, s.t. x y.4. Conversely, if x y, then for any edge y y, then there exists an edge
x x s.t. x y.
a a
a
a
DataGuides
• Goldman & Widom [VLDB 97]– graph data– arbitrary regular expressions
DataGuides
Definition
given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G
- every path in G occurs in DB
- every path in G is unique
Dataguides
Example:
DataGuides
• Multiple DataGuides for the same data:
DataGuides
Definition
Let w, w’ be two words (i.e. word queries) and G a graph
w G w’ if w(G) = w’(G)
Definition
G is a strong dataguide for a database DB if G is the same as DB
DataGuides
Example:
• G1 is a strong dataguide
• G2 is not strong
person.project !DB dept.project
person.project !G2 dept.project
DataGuides
• Constructing the strong DataGuide G:Nodes(G)={{root}}Edges(G)=while changes do
choose s in Nodes(G), a in Labelsadd s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)
• Use hash table for Nodes(G)• This is precisely the powerset automaton
construction.
DataGuides• How large are the dataguides ?
– if DB is a tree, then size(G) <= size(DB)• why? answer: every node is in exactly one extent of G• here: dataguide = XSet
– How many nodes does the strong dataguide have for this DB ?
20 nodes (least common multiple of 4 and 5)
Dataguides usually fail on data with cyclic schemas, like: