Download - Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric

Web Data Management

Indexes

In this lecture• Indexes

– XSet

– Region algebras

– Indexes for Arbitrary Semistructured Data

– Dataguides

– T-indexes

– Index Fabric

Resources• Index Structures for Path Expressions by Milo and Suciu, in ICDT'99• XSet description: http://www.openhealth.org/XSet/

• Data on the Web Abiteboul, Buneman, Suciu : section 8.2

The problem

• Input: large, irregular data graph

• Output: index structure for evaluating regular path expressions

The Data

Semistructured data instance = a large graph

The queriesRegular expressions (using Lorel-like syntax)

SELECT XfROM (Bib.*.author).(lastname|firstname).Abiteboul X

Select xfrom part._*.supplier.name x

Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression.

Select XFrom part._*.supplier: {name: X, address: “Philadelphia”}

Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

Analyzing the problem

• what kind of data– tree data (XML): easier to index – graph data: used in more complex applications

• what kind of queries– restricted regular expressions (e.g. XPath): may

be more efficient– arbitrary regular expressions: rarely

encountered in practice

XSet: a simple index for XML

• Part of the Ninja project at Berkeley• Example XML data:

XSet: a simple index for XML

Each node = a hashtable

Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation

• To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name.

• R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name.

• Thus, explore the entire subtree dominated by h2.• Will be efficient if index is small and fits in memory• R3 – leading wild card forces to consider all nodes in the index tree,

resulting in less efficient computation than for R4.• Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal

search from there.

(R1) SELECT X FROM part.name X -yes

(R2) SELECT X FROM part.supplier.name X -yes

(R3) SELECT X FROM *.supplier.name X -maybe

(R4) SELECT X FROM part.*.subpart.name X -maybe

Region Algebras• structured text = text with tags (like XML)

• powerful indexing techniques[Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc.]

• New Oxford English Dictionary

• critical limitation:ordered data only (like text)

• Assume: data given as an XML text file, and implicit ordering in the file.

• less critical limitation: restricted regular expressions

Region Algebras: Definitions• data = sequence of characters [c1c2c3 …]

• region = segment of the text in a file– representation (x,y) = [cx,cx+1, … cy], x – start position, y –

end position of the region– example: <section> … </section>

• region set = a set of regions s.t. any two regions are either disjoint or one included in the other– example all <section> regions (may be nested)– Tree data – each node defines a region and each set of nodes

define a region set.– example: region p2 consisting of text under p2, set {p2,s2,s1}

is a region set with three regions

Representation of a region set

• Example: the <subpart> region set:

• region algebra = operators on region set, ss11 op s op s22 defines a new region set

Region algebra: some operators

• s1 intersect s2 = {r | r s1, r s2}

• s1 included s2 = {r | rs1, r´ s2, r r´}

• s1 including s2 = {r | r s1, r´ s2, r r´}

• s1 parent s2 = {r | r s1, r´ s2, r is a parent of r´}

• s1 child s2 = {r | r s1, r´ s2, r is child of r´}

Examples:

<subpart> included <part> = { s1, s2, s3, s5}

<part> including <subpart> = {p2, p3}

<name> child <part> = {n1, n3, n12}

Efficient computation of Region Algebra Operators

Example: s1 included s2

s1 = {(x1,x1'), (x2,x2'), …}

s2 = {(y1,y1'), (y2,y2'), …}

(i.e. assume each consists of disjoint regions)

Algorithm:if xi < yj then i := i + 1

if xi' > yj' then j := j + 1

otherwise: print (xi,xi'), do i := i + 1

Can do in sub-linear time when one region is very small

From path expressions to region expressions

• Use region algebra operators to answer regular path expressions:

• Only restricted forms of regular path expressions can be translated into region algebra operators – expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene closure *.

Region expressions correspond to simple XPath expressions

part.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))

From path expressions to region expressions

• Answering more complex queries:

• Translates into the following region algebra expression:

• “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text.

• Such a region can be computed dynamically using a full text index.

• Region expressions correspond to simple XPath expressions

Select XFrom *.subpart: {name: X, *.supplier.address: “Philadelphia”}

Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))

Indexes for Arbitrary Semistructured Data

• A semistructured data instance that is a DAG


• The data represents employees and projects in a company.• Two kinds of employees – programmers and statisticians• Three kinds of links to projects – leads, workson, consultants• Index graph – reduced graph that summarizes all paths from root in the data

graph• Example: node p1 – paths from root to p1 labeled with the following five

sequences:

ProjectEmployee.leadsEmployee.worksonProgrammer.employee.leadsProgrammer.employee.workson

• Node p2 – paths from root to p2 labeled by same five sequences• p1 and p2 are language-equivalent


• For each node x in the data graph,

Lx = {w| a path from the root to x labeled w}

x,y x y Lx = Ly

[x] = {y | x y }

Nodes(I) = {[x] | x nodes(G)

I =

Edges(I) = {[x] [y] | x [x], y [y], x y } a a


• We have the following equivalences:e1 e2e3 e4 e5p1 p2p3 p4p5 p6 p7


• Computing path expression queries– Compute query on I and obtain set of index nodes– Compute union of all extents

• Returns nodes h8, h9.• Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8]• Always: size(I) size(G)• Efficient when I can be stored in main memory• Checking x y is expensive.

Select XFrom statistician.employee.(leads|consults): X


Use bisimulation instead of Fact: x, y x b y x y

Use the same construction, but [u] now refers to b instead of .

Bisimulation: Let DB be a data graph. A relation is a bisimulation on the reversed graph (i.e. all edges have their direction reversed) if the following conditions hold:

1. If x y and x is a root, then so is y.

2. Conversely, if x y and y is a root, then so is x.

3. If x y, then for any edge x x there exists an edge y y, s.t. x y.4. Conversely, if x y, then for any edge y y, then there exists an edge

x x s.t. x y.

a a

a

a

DataGuides

• Goldman & Widom [VLDB 97]– graph data– arbitrary regular expressions

DataGuides

Definition

given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G

- every path in G occurs in DB

- every path in G is unique

Dataguides

Example:

DataGuides

• Multiple DataGuides for the same data:

DataGuides

Definition

Let w, w’ be two words (i.e. word queries) and G a graph

w G w’ if w(G) = w’(G)

Definition

G is a strong dataguide for a database DB if G is the same as DB

DataGuides

Example:

• G1 is a strong dataguide

• G2 is not strong

person.project !DB dept.project

person.project !G2 dept.project

DataGuides

• Constructing the strong DataGuide G:Nodes(G)={{root}}Edges(G)=while changes do

choose s in Nodes(G), a in Labelsadd s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)

• Use hash table for Nodes(G)• This is precisely the powerset automaton

construction.

DataGuides• How large are the dataguides ?

– if DB is a tree, then size(G) <= size(DB)• why? answer: every node is in exactly one extent of G• here: dataguide = XSet

– How many nodes does the strong dataguide have for this DB ?

20 nodes (least common multiple of 4 and 5)

Dataguides usually fail on data with cyclic schemas, like:

Download - Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric

Top Related