highly heterogeneous xml collections: how to retrieve precise results?

12
Highly Heterogeneous XML Collections: How to retrieve precise results? Ismael Sanz 1 , Marco Mesiti 2 , Giovanna Guerrini 3 , Rafael Berlanga Llavori 1 (1) Universitat Jaume I, Castell´on, Spain - {berlanga,Ismael.Sanz}@uji.es (2) Universit`a di Milano, Italy - [email protected] (3) Universit`a di Genova, Italy - [email protected] Abstract. Highly heterogeneous XML collections are thematic collec- tions exploiting different structures: the parent-child or ancestor-descen- dant relationships are not preserved and vocabulary discrepancies in the element names can occur. In this setting current approaches return an- swers with low precision. By means of similarity measures and semantic inverted indices we present an approach for improving the precision of query answers without compromising performance. 1 Introduction Handling the heterogeneity of structure and/or content of XML documents for the retrieval of information is being widely investigated. Many approaches [1, 2, 9, 15] have been proposed to identify approximate answers to queries that al- low structure and content condition relaxation. Answers are ranked according to quality and relevance scores and only the top-k better results are returned. Current approaches consider the presence of optional and repeatable elements, the lack of required elements and some forms of relaxation of the parent-child relationship among elements. However, they do not cope with all the forms of heterogeneity that can occur in practice due to structure and content hetero- geneity. Suppose to have an heterogeneous collection of documents about books. In the collection information are organized either around authors (i.e., for each author the books he/she wrote) or around books themselves (i.e., for each book the list of its authors). Current approaches fail to find relevant solutions in this collection because they can relax structural constraints (i.e., book/author becomes book//author) but they are not able to invert the relationship (i.e., book/author cannot become author//book). In addition, few approaches [11, 13, 17] exploit an ontology or a string edit function for relaxing the exact tag name identification and thus allow the substitution of lexemes author and book with synonyms (e.g., volume and composition for book, writer and creator for author) or with a similar string (e.g., mybook for book and theAuthor for author). In this paper we deal with structural queries (named patterns) and we model them as graphs in which different kinds of constraints on the relationships among nodes (parent-child/ancestor-descendant/sibling) and on node tags (syntactic and semantic tag similarity) can be enforced or relaxed. The identification of

Upload: unige-it1

Post on 14-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Highly Heterogeneous XML Collections:

How to retrieve precise results?

Ismael Sanz1, Marco Mesiti2, Giovanna Guerrini3, Rafael Berlanga Llavori1

(1) Universitat Jaume I, Castellon, Spain - {berlanga,Ismael.Sanz}@uji.es(2) Universita di Milano, Italy - [email protected]

(3) Universita di Genova, Italy - [email protected]

Abstract. Highly heterogeneous XML collections are thematic collec-tions exploiting different structures: the parent-child or ancestor-descen-dant relationships are not preserved and vocabulary discrepancies in theelement names can occur. In this setting current approaches return an-swers with low precision. By means of similarity measures and semanticinverted indices we present an approach for improving the precision ofquery answers without compromising performance.

1 Introduction

Handling the heterogeneity of structure and/or content of XML documents forthe retrieval of information is being widely investigated. Many approaches [1,2, 9, 15] have been proposed to identify approximate answers to queries that al-low structure and content condition relaxation. Answers are ranked accordingto quality and relevance scores and only the top-k better results are returned.Current approaches consider the presence of optional and repeatable elements,the lack of required elements and some forms of relaxation of the parent-childrelationship among elements. However, they do not cope with all the forms ofheterogeneity that can occur in practice due to structure and content hetero-geneity. Suppose to have an heterogeneous collection of documents about books.In the collection information are organized either around authors (i.e., for eachauthor the books he/she wrote) or around books themselves (i.e., for each bookthe list of its authors). Current approaches fail to find relevant solutions inthis collection because they can relax structural constraints (i.e., book/authorbecomes book//author) but they are not able to invert the relationship (i.e.,book/author cannot become author//book). In addition, few approaches [11,13, 17] exploit an ontology or a string edit function for relaxing the exact tagname identification and thus allow the substitution of lexemes author and bookwith synonyms (e.g., volume and composition for book, writer and creator forauthor) or with a similar string (e.g., mybook for book and theAuthor for author).

In this paper we deal with structural queries (named patterns) and we modelthem as graphs in which different kinds of constraints on the relationships amongnodes (parent-child/ancestor-descendant/sibling) and on node tags (syntacticand semantic tag similarity) can be enforced or relaxed. The identification of

the answers in the XML collection (target) is realized by the identification ofregions in the target in which nodes are similar enough to the pattern, andsimilarity measures that evaluate and rank the obtained regions. This changeof approach has the effect that the efficiency of traditional approaches is com-promised, but the precision is increased. Since performance is very relevant, weintroduce indexing structures for easily identifying regions. A semantic invertedindex is built on the target whose entries are the stems of the tags in the target.

Ranked tree-matching approaches have been proposed for dealing with theNP complexity of the tree inclusion problems [7]. These approaches, instead ofgenerating all (exponentially many) candidate subtrees, return a ranked list of“good enough” matches. In [14] query results are ranked according to a costfunction through a dynamic programming algorithm, in [1] intermediate queryresults are filtered dynamically during evaluation through a data pruning al-gorithm, while ATreeGrep [16] is based on an exact matching algorithm, buta fixed number of “differences” in the result is allowed. Our approach returnsas well a ranked list of “good enough” subtree matches, but it is highly flexi-ble since it allows choosing the most appropriate structural similarity measureaccording to the application semantics. Moreover, our approach also includesapproximate label matching, which allows dealing with heterogeneous tag vo-cabularies. Starting from the user query, approaches for structural and contentscoring of XML documents [1, 2, 9, 15] generate query relaxations that preservethe ancestor-descendant relationships. These relaxations do not consider the ac-tual presence of such a structure in the documents. In our approach, by exploitingthe indexing structures, only the variations to the pattern that occur in the tar-get are considered. Moreover, the ancestor-descendant relationships imposed bythe query can be reversed. Work from the Information Retrieval (IR) area is alsoconcerned with XML document retrieval [8], by representing them with knownIR models. These approaches mainly focus on the textual content disregardingtheir structure. Our approach extends in different directions the work proposedin [11] where patterns are expressed as trees stating a preferred hierarchicalstructure for the answers to be retrieved. In this paper, by contrast, patterns aregraphs in which different kinds of constraints on the relationships among nodesand on node tags can be expressed. Graph based flexible XML queries have beenproposed in [4]. No ad-hoc structures for efficiently evaluating them is devised.

In the remainder, Section 2 introduces patterns, targets, and regions. Section3 discusses similarity evaluation while region construction is presented in Section4. Experiments are discussed in Section 5. Section 6 concludes.

2 Patterns, Targets, and Regions

We first define flexible queries, then the tree representation of XML documentcollections and finally the document portions that match a pattern.

Patterns as Labelled Graphs. When dealing with heterogeneous collections,users require to express their information requests through a wide range of ap-proximate queries. In our approach, patterns are provided for the specification of

book

author

name

editors

(a) P1

book

author

name

editors

(b) P2

book

author

name

editors

(c) P3

Fig. 1. Pattern with (a) no constraints, (b) constraints on tags, (c) constraints on tagsand structure

approximate structural and tag conditions that documents or portions of themmust cope with as much as possible. In the simplest form, a pattern is a set oflabels for which a “preference” is specified on the hierarchical or sibling order inwhich such labels or similar labels should occur. Fig. 1(a) shows an example ofthis simplest form of pattern. Document portions that match such a pattern cancontain elements with similar tags in a different order and bound by differentdescendant/sibling relationships. Some of the required elements can be missing.

Tighter constraints can be specified by setting stricter conditions on elementnames and on relationships among nodes. For example, Fig. 1(b) shows thepattern in Fig. 1(a) in which the book and author elements in the documentportions should exactly be book and author, that is, synonyms are not allowed(plain ovals). As another example, Fig. 1(c) shows a combination of structuraland tag constraints. The author element must be a child of the book element(plain arrow from book to author), author must be an ancestor of the name

element (double line arrow) and right sibling of the editor element (double linearrow tagged s).

These are only few examples of the patterns that a user can create by spec-ifying our constraints. Constraints are categorized as follows (a graphical rep-resentation is in Table 1): (i) descendant constraints DC, described in Table1:(1,2,3,4,5); (ii) same level constraints LC, described in Table 1:(6,7,8,9,10);(iii) tag constraints T C described in Table 1:(11,12).

Definition 1. (Pattern). A pattern P = (V,ED, ES , CED, CES

, CV , label) is adirected graph where V is a set of nodes, label is the node labelling function,ED and ES are two sets of edges representing the descendant or same level rela-tionship between pairs of nodes, respectively, CED

, CES, and CES

are functionsthat associates with edges in ED, edges in ES, and nodes in V , respectively, aconstraint in DC, SC, and T C.

Target and its Semantic Inverted Index. The target is a collection of het-erogeneous XML documents, conveniently represented as a labelled tree1 whoseroot is labelled db and whose subelements are the documents in the collection.An example of target is shown in Fig. 2(a). The distance Dist(u, v) between two

1 A tree T = (V, E) is a structure s.t. V = V(T ) is a finite set of nodes, root(T ) ∈ Vis the tree root and E is a binary relation on V with the known restrictions on Ethat characterize a tree.

# Edge/Node repr. Constraint description

1u

vdc1(u, v) = true

2u

vdc2(u, v) = true if u is father of v or v is father of u

3u

vdc3(u, v) = true if u is father of v

4u

vdc4(u, v) = true if u is ancestor of v or v is ancestor of u

5u

vdc5(u, v) = true if u is ancestor of v

6 u vs sc1(u, v) = true

7 u vs sc2(u, v) = true if u is sibling of v

8 u vs

( )sc3(u, v) = true if u is left (right) sibling of v

9 u vs

sc4(u, v) = true if u is in the same level of v

10u v

s

( )sc5(u, v) = true if u precedes (follows) v in the same level

11 vl tc1(v) = true if v is labelled by l or a label similar to l

12 vl tc2(v) = true if v is labelled exactly by l

Table 1. Pattern constraints

nodes u, v in a tree is specified as the number of nodes traversed moving from u

to v in the pre-order traversal. The nearest common ancestor nca(u, v) betweentwo nodes u, v in a tree is the common ancestor of u and v whose distance tou (and to v) is smaller than the distance to u of any other common ancestor ofu and v. Two labels are similar if they are identical, or synonyms relying on agiven Thesaurus, or syntactically similar relying on a string edit function [18].Let l1, l2 be two labels, l1 ≃ l2 iff: (1) l1 = l2 or (2) l1 is a synonym of l2, or(3) l1 and l2 are syntactically similar. Given a label l and a set of labels L, weintroduce the operator similarly belongs, ∝: l ∝ L iff ∃n ∈ L such that l ≃ n.

A semantic inverted index is coupled with the target. The index entries arethe stems of the element tags occurring in the documents ordered according totheir pre-order traversal. Each entry contains the list of tags syntactically orsemantically similar to the entry stem. For each node v, the list contains the 4-tuple (pre(v), post(v), level(v), P(v)), representing the pre/post order rankingand level of v in the tree and its parent node. A node is identified by pre(v).Fig. 2(b) depicts the inverted index for the target in Fig. 2(a). For the sake ofgraphical readability, the parent of each vertex is not represented (only a · isreported). We remark that the two elements pages and page belong to the sameentry because they share the stem (page); elements author and writer belongto the same entry because they are semantically similar relying on WordNet.

db

book author d blp

author title editor

name

name book

title pages

name wr iter

book article

title page

(a)

17,15,3

6, 10, 12, 2, 2

8, 9, 21, 5, 1

13,16,2

14,14,3,

11,17, 1

7, 6, 23, 1, 3 12,11,2

16,13,410,8,3

15,12,49, 7, 34, 3, 2

5, 4, 2

article

author

book

dblp

name

page

title

editor

(b)

Fig. 2. (a) Tree representation of a collection of documents and (b) its inverted index

Regions. Regions are the target portions containing an approximate match forthe pattern. Different interpretations can be given to constraint violations thatlead to different strategies for region construction.Strategy 1: regions that do not meet one of the specified constraints should beeliminated. In this case, in the construction of regions when we detect that aconstraint is violated the corresponding region should be pruned. This leads toreduce the number of regions to be checked.Strategy 2: constraints in regions can be violated; violations, however, penal-ize the region in similarity evaluation. This approach requires considering con-straints after having determined regions.

Choosing a strategy over another depends on the specific characteristics ofthe target application. In our approach we support both, and let the systemdesigner decide which is most appropriate in each case. In what follows, weformalize the second strategy, since it is more flexible.

Regions in the collection that are similar to a pattern are identified in twosteps. First, forgetting the constraints specified in the pattern and exploiting thetarget inverted index, we identify the subtrees, named fragments, of the targetin which nodes with labels similar to those in P appear. Nodes in fragments arethose nodes in the target with labels similar to those in the pattern. Two nodesu, v belong to the same fragment F for a pattern P, iff their labels as well asthe label of their common ancestor similarly belong to the labels of the pattern.

Definition 2. (Fragment). A fragment F of a target T = (VT , ET ) for a patternP is a subtree (VF , EF ) of T for which the following properties hold:

– VF is the maximal subset of VT such that root(T ) 6∈ VF and ∀u, v ∈ VF ,

label(u), label(v), label(nca(u, v)) ∝ label(V(P ));– For each v ∈ VF , nca(root(F ), v) = root(F );– (u, v) ∈ EF if u is an ancestor of v in T , and there is no node w ∈ VF ,

w 6= u, v such that w is in the path from u to v.

Several fragments can be identified in a target. Fragments might be com-bined together when they are close and their combination can lead to a subtree

s

book

author

name

book

author

name

editorname writer

book

dblp

R1

R2

R3

book

author

name

editor

Fig. 3. Identification of different mappings

more closely meeting the constraints expressed by the target. Regions are thusconstructed by introducing (when required) a common unmatching ancestor.

Example 1. Consider as T the tree in the target in Fig. 2(a) whose label is dblp.Its left subtree contains the element name, whereas the right subtree containsthe elements writer and book. T could have a higher similarity with the patterntree in Fig. 1(a) than its left or right subtrees.

Definition 3. (Regions). Let FP (T ) be the set of fragments of a target T for apattern P . The corresponding set of regions RP (T ) is defined as follows.

– FP (T ) ⊆ RP (T );– For each F = (VF , EF ) ∈ FP (T ) and for each R = (VR, ER) ∈ RP (T )

such that label(nca(root(F ), root(R))) 6= db, S = (VS , ES) ∈ RP (T ), where:root(S) = nca(root(F ), root(R)), VS = VF ∪VR ∪{root(S)}, ES = EF ∪ER

∪{(root(S), root(F )), (root(S), root(R))}.

Fig. 3 contains the three regions R1, R2, R3 obtained for the pattern in Fig.1(a). Regions R1 and R2 are simple fragments, whereas region R3 is a combi-nation of two fragments. We remark that the three regions present a differentstructure with respect to the one specified in the pattern.

3 Structural and Constraint based Similarity Evaluation

In this section we first identify a mapping between the nodes in the patternand the nodes in the region having similar labels. Then, by means of a similaritymeasure, the hierarchical structure of nodes and the pattern constraints are usedto rank the resulting regions.

Mapping between a Pattern and a Region. A mapping between a patternand a region is a relationship among their elements that takes the tags used inthe documents into account. Our definition relies on our second strategy andonly require that the element labels are similar.

SimM SimL SimD

author book editor name author book editor name author book editor name

P1/R1 1 1 1 1 1 1 1 1 1 1 1 1

P1/R2 1 1 0 1 1

3

1

30 1

3

1

4

1

40 1

4

P1/R3 1 − δ 1 0 1 1 − δ1

30 1

3

3

4−δ

1

40 1

4

Table 2. Different node similarities between pattern P1 and the regions in Fig. 3

Definition 4. (Mapping M). Let P be a pattern, and R a region subtree of atarget T . A mapping M is a partial injective function between the nodes of P

and those of R such that ∀xp ∈ V(P ),M(xp) 6=⊥⇒ label(xp) ≃ label(M(xp)).

The three patterns in Fig. 1 lead to an analogous mapping to the targetin Fig. 2(a). The difference in score is determined by the following similaritymeasures.

Similarity between Matching Nodes. Three approaches have been devised.In the first approach, the similarity depends on node labels and on the presenceof tag constraints in the pattern. Similarity is 1 if labels are identical, whereasa pre-fixed penalty δ is applied if the tag constraint specified in the pattern isverified by the matching node in the region. If the tag constraint is not verified,similarity is 0. In the second approach, the match-based similarity is combinedwith the evaluation of the level at which xp and M(xp) appear in the patternand region structure, respectively. Whenever they appear in the same level, theirsimilarity is equal to the similarity computed by the first approach. Otherwise,their similarity linearly decreases as the number of levels of difference increases.Since two nodes can be in the same level, but not in the same position, a thirdapproach is introduced. Relying on the depth-first traversal of the pattern (wheredescendant edges are traversed before sibling edges) and the region, the similarityis computed by taking the distance of nodes xp and M(xp) with respect to theirroots into account. Thus, in this case, the similarity is the highest only when thetwo nodes are in the same position in the pattern/region.

Definition 5. (Similarity between Matching Nodes). Let P be a pattern, R bea region in a target T , xp a node of P , tc a tag constraint associated with xp,and xr = M(xp). Their similarity can be computed as follows:

1. Match-based similarity:SimM (xp, xr) =

1 if label(xp)=label(xr)

1−δ if label(xp)≃label(xr), tc(xr)=true

0 otherwise

2. Level-based similarity: SimL(xp, xr) = SimM (xp, xr)−|levelP (xp)−levelR(xr)|max(level(P ),level(R)) ;

3. Distance-based similarity: SimD(xp, xr) = SimM (xp, xr) −|dP (xp)−dR(xr)|max(dmax

P,dmax

R) .

In the last two cases the similarity is 0 if the obtained value is below 0.

P1 P2 P3

SimM SimL SimD SimM SimL SimD SimM SimL SimD

R1 1 1 1 1 1 1 1 1 1

R23

4

1

4

3

16

3

4

1

4

3

16

2

3

1

3

7

24

R33−δ4

1−δ4

3−δ16

2

4

1

6

1

8

1

3

1

9

1

12

Table 3. Evaluation of similarity among patterns and regions

Example 2. Table 2 reports the similarity of nodes of pattern P1 in Fig. 1(a) withthe corresponding nodes in the three regions, relying on the proposed similaritymeasures.

Similarity of a Region w.r.t. a Pattern. Once evaluated the similarity onthe basis of the matching nodes and the tag constraints in the pattern, the con-straints on the ancestor-descendant and sibling edges are considered as specifiedin the following definition. In order to obtain an evaluation of the mapping in therange [0,1] we first add to the evaluation of the nodes the number of edge con-straints specified in the pattern that are met by the region. The obtained valueis divided by the sum of the number of nodes in the pattern and the number ofedge constraints specified in the pattern.

Definition 6. (Evaluation of a Mapping M). Let M be a mapping between apattern P = (V,ED, ES , CED

, CES, CV , label) and a region R = (VR, ER), and

Sim one of the similarity measures of Definition 5. Let

– MV =∑

xp∈V :M(xp) 6=⊥ Sim(xp,M(xp)) be the evaluation of the mappingnodes between the two structures,

– EC = {(xp, xq) ∈ ED ∪ ES |M(xp) 6=⊥,M(xq) 6=⊥} be the set of edges inthe pattern for which the nodes of the edges occur in the region,

– V EC = {(xp, xq) ∈ EC|Con ∈ CED∪ CES

,

Con((xp, xq))(M(xp),M(xq)) = true}2 be the edges in C for which the cor-responding constraints are verified.

The evaluation of M is: Eval(M) =MV + |V EC||V | + |EC|

Once mappings have been evaluated, the similarity between a pattern and aregion can be defined as the maximal evaluation so obtained.

Definition 7. (Similarity between a Pattern and a Region). Let M be the setof mappings between a pattern P and a region R. Their similarity is defined as:

Sim(P,R) = maxM∈MEval(M)

Example 3. Consider the situation described in Example 2. The similarities be-tween the three patterns and the three regions computed according to the nodesimilarity measures of Definition 5 are in Table 3.

2 Con((xp, xq)) identifies a constraint c associated with the edge (xp, xq); c is appliedto the corresponding edge in the region.

1

2

3

author(2, 2, 2)tc2,dc3,dc5,sc5

author(6, 10, 1)tc2,dc3,dc5,sc5

writer(13,16,2)tc2,dc3,dc5,sc5

book(1,5,1)tc1, dc3

editor(5,4,2)tc1,sc5

book(8,9,2)tc1,dc3

book(14,14,3)tc1,dc3

name(3,1,3)tc1,dc5

name(12,11,2)tc1,dc5

name(7,6,2)tc1,dc5

(a)

book(14,14,3)tc1,dc3

writer(13,16,2)tc2,dc3,dc5,sc5

F4

name(12,11,2)tc1,dc5

F3

author(6, 10, 1)tc2,dc3,dc5,sc5

book(8,9,2)tc1,dc3

name(7,6,2)tc1,dc5

F2

author(2, 2, 2)tc2,dc3,dc5,sc5

book(1,5,1)tc1,dc3

editor(5,4,2)tc1, sc5

name(3,1,3)tc1, dc5

F1

(b)

book(14,14,3) tc1 dc3

name(12,11,2) tc1,dc5

writer(13,16,2)tc2,dc3,dc5,sc5

dblp(11,17,1)

R

(c)

Fig. 4. (a) Pattern index, (b) fragments, and (c) a generated region

4 Region Construction

There are two main challenges in region construction. First, only the informa-tion contained in the target index should be exploited and all the operationsshould be performed in linear time. Second, the evaluation of each pattern con-straint should be computed in constant time. Region construction is realized intwo steps: fragment construction and fragment merging. Fragment construction(detailed in the remainder of the section) is realized through the use of an index,named pattern index, that is computed on the fly on the basis of a pattern P andthe semantic inverted index of a target T . Fragment merging (detailed in [12]) ina single region is performed when, relying on the adopted similarity function, thesimilarity of the pattern with the region is higher than the similarity with theindividual fragments. In these evaluations the pattern constraints are consideredand the following heuristic principle is exploited: only adjacent fragments canbe merged since only in this case the regions can have an higher similarity thaneach single fragment.

Pattern Index. Given a pattern P , for every node v in P , all occurrencesof nodes u in the target tree such that label(v) ≃ label(u) are retrieved, andorganized level by level in a pattern index. Each node in the pattern index iscoupled with the following tuple (TC, DC, SC, VTC, VDC, VSC), where TC, DC, SC aresets that contain, respectively, the tag, descendent, and same level constraintsspecified for the matching node in the pattern (note that constraints on edgesare reported in both nodes). Moreover, VTC, VDC, VSC are sets that will containthe verified constraints. In the creation of the pattern index the TC and VTC

sets can be filled in, whereas the other constraints should be evaluated duringfragment and region construction. The number of levels of the pattern index

Topic DIST PCH SIGN SIBL

Weather 1.0 100% 0% 93%

Stock 0.9 74% 11.6% 67%

Address 0.89 63% 0% 59%

Credit Card 0.5 75% 0% 59%Table 4. Heterogeneity Measures for the selected ASSAM topics

depends on the the levels in T in which nodes occur with labels similar to thosein the pattern. For each level, nodes are ordered according to the pre-order rank.Fig. 4(a) contains the pattern index for the pattern in Fig. 1(c) evaluated onthe target in Fig. 2(a). Tag constraints have been evaluated and those verifiedcircled, whereas the violated ones have been overlined.

Identification of Fragments from the Pattern Index. Once the patternindex is generated, the fragments are obtained through a visit of the targetstructure. Moreover, for each node presenting a constraint, the constraint ischecked. Each node v in the first level of the pattern index is the root of afragment because, considering the way we construct the pattern index, no othernodes can be the ancestor of v. Possible descendants of v can be identified inthe underlying levels whereas sibling nodes can be adjacent in the same level.Given a generic level l of the pattern index, a node v can be a root of a fragmentiff for none of the nodes u in previous levels, v is a descendant of u. If v is adescendant of a set of nodes U , v is considered the child of the node u ∈ U

such that Dist(v, u) is minimal. Descendant and same level constraints can beevaluated when a node is attached to a fragment.

Our algorithm visits each node in the pattern index only once by markingin each level the nodes already included in a fragment. Its complexity is thuslinearly proportional to the number of nodes in the pattern index. Fig. 4(b)illustrates fragments F1, . . . , F4 obtained from the pattern index of Fig. 4(a).Fragments F3 and F4 are then merged in the region R in Figure 4(c).

5 Preliminary Experiments

To evaluate our approach we developed a prototype using the Berkeley DB li-brary and tested the system with patterns expressed in the simplest form (thatis, elements only bound by the ancestor-descendant relationship and tag similar-ity allowed). The performance was tested using synthetic collections with vary-ing degrees of structural variations, ranging from 105 to 107 nodes and queriesyielding result sets ranging between 7500 and 3 × 105 nodes; in every case theperformance was shown to be linearly dependent on the size of the result set.

To evaluate the impact of constraint inclusion, as well as the different simi-larity measures for patterns and regions, we have tested our approach with theASSAM dataset (http://moguntia.ucd.ie/repository/datasets/). This is asmall collection (117 XML documents and 778 terms) that contains the concep-tual schemas of a series of available public Web Services. These schemas present

0.0

0.2

0.4

0.6

0.8

1.0

Extra Restrictions

Pre

cisi

on/R

ecal

l

0 1 2 3 4 5 6 7 8 9

PrecisionRecall

(a) Weather

0.0

0.2

0.4

0.6

0.8

1.0

Extra Restrictions

Pre

cisi

on/R

ecal

l

0 1 2 3 4 5 6 7 8 9

PrecisionRecall

(b) Stock

0.0

0.2

0.4

0.6

0.8

1.0

Extra Restrictions

Pre

cisi

on/R

ecal

l

0 1 2 3 4 5 6 7 8 9

PrecisionRecall

(c) Address

0.0

0.2

0.4

0.6

0.8

1.0

Extra Restrictions

Pre

cisi

on/R

ecal

l

0 1 2 3 4 5 6 7 8 9

PrecisionRecall

(d) Card

Fig. 5. Precision/Recall results for the selected topics wrt. the number of constraintsincluded. The error bars are drawn at a 95% confidence level

a high heterogeneity in both tag names and schema structures. For the evalua-tion, we have selected four topics from this dataset, namely: weather, stock index,address specifications and credit card information. For each topic, we assessedthe sets of relevant results manually, for computing precision and recall values.For each topic we have generated a set of patterns randomly created by specify-ing a set of parameters (number of nodes, number of constraints, probability toidentify regions for such a pattern).

To show the relevance of the proposed patterns in our context, we charac-terize each pattern with an estimate of the degree of heterogeneity of the resultset. For all the parent/child relations (u, v) of the pattern generation tree, wecalculate the global average distance (DIST ) between u and v in all the data-base regions where they occur. We also calculate the percentage of these regionswhere (u, v) are actually parent/child relationships (PCH), and the percentageof regions where (u, v) inverts its direction (SIGN). For all the siblings relations(u, v) of the pattern generation tree, we calculate the percentage of fragmentswhere they are actually siblings (SIBL). Table 4 shows the evaluation topicsmeasures. Notice that some average distance (DIST ) can be less than one. Thisis because some pairs of pattern labels can appear together in the same targettag (i.e. distance 0). Fig. 5 shows the impact of constraint inclusion in patternswhen using the similarity function SimL. In the experiments we have measuredthe precision and recall considering the assigned target regions in the genera-tion pattern tree. Notice that despite the heterogeneity degree of the selectedpatterns, we obtain good results for precision with respect to the case with 0constraints. The exception is the topic “Address”, which include very ambiguousterms that appear frequently in other topics (e.g. zip, street, etc.). This issuecould be solved by using a more semantic-aware label distance measure.

6 Conclusions and Future Work

In this paper an approach for structure-based retrieval in highly heterogeneousXML collections has been devised. The peculiarity of the approach is that it doesnot rely on the document hierarchical organization. In order to deal with the highnumber of possible matchings, a semantic indexing structure and a similaritymeasure for pruning irrelevant results are exploited. Moreover, search patternscan contain constraints on element tags, descendant and sibling relationships.This allow the user to be aware of some (but not all) structural constraints indocuments. Preliminary experimental results show the effectiveness of the pro-posed patterns to retrieve relevant document portions, though further researchis required to define new similarity functions to handle the hardest cases.

As future work, we plan to include in our framework content conditions. Inan heterogeneous environment like the one we focus on, content conditions canbe specified both on the leaves and the internal nodes of the pattern. Moreover,we wish to consider the application of our techniques for the computation ofapproximate structural joins in highly heterogeneous XML collections and forsubtree identification in heterogeneous XML Schemas collections.

References

1. Amer-Yahia, S., et al.:Tree Pattern Relaxation. EDBT. (2002) 496–513.2. Amer-Yahia, S., et al.: Structure and Content Scoring for XML. VLDB. (2005).3. Buneman, P., et al.: Adding Structure to Unstructured Data. ICDT. (1997).4. Damiani, E., Tanca, L.: Blind Queries to XML Data. DEXA. (2000). 345–356.5. Grust, T.: Accelerating XPath Location Steps. SIGMOD. (2002) 109–120.6. Kanza, Y., Sagiv, Y.: Flexible Queries Over Semistructured Data. PODS. (2001).7. Kilpelainen, P.: Tree Matching Problems with Applications to Structured Text

Databases. PhD thesis, University of Helsinki (1992).8. Luk, R.W., et al.: A Survey in Indexing and Searching XML Documents. JASIS

53:(2002)415–438.9. Marian, A., et al.:Adaptive Processing of Top-k Queries in XML. ICDE. (2005).

10. Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents.WebDB. (2002) 61–66.

11. Sanz, I., et al.: Approximate Subtree Identification in Heterogeneous XML Docu-ments Collections. Xsym. LNCS(3671) (2005) 192-206.

12. Sanz, I., et al.: Highly Heterogeneous XML Collections: How to find “good” re-sults?. TR University of Genova, 2006.

13. Schenkel, R., et al.: Ontology-Enabled XML Search. LNCS(2818),(2003)119-131.14. Schlieder, T., Naumann, F.: Approximate Tree Embedding for Querying XML

Data. In: ACM SIGIR Workshop on XML and IR. (2000).15. Schlieder, T. Schema-Driven Evaluation of Approximate Tree Pattern Queries.

EDBT. LNCS(2287). (2002) 514–532.16. Shasha, D., et al.: ATreeGrep: Approximate Searching in Unordered Trees. In:

14th Conf. on Scientific and Statistical Database Management. (2002) 89–98.17. Theobald, A., Weikum, G.: The Index-Based XXL Search Engine for Querying

XML Data with Relevance Ranking. EDBT. LNCS(2287). (2002) 477-495.18. Wagner, R.A., Fischer, M.J.: The String-to-string Correction Problem. J. of the

ACM 21:(1974)168–173.