a cluster-based approach to xml similarity · pdf file · 2009-07-21a cluster-based...

A Cluster-Based Approach to XML Similarity Joins

Leonardo A. Ribeiro∗

University of Kaiserslautern,Germany

AG DBIS, Department ofComputer Science

[email protected]

Theo HärderUniversity of Kaiserslautern,

GermanyAG DBIS, Department of

Computer [email protected]

Fernanda S. Pimenta†

UFRGS – Institute ofInformatics

Porto Alegre, [email protected]

ABSTRACTA natural consequence of the widespread adoption of XMLas standard for information representation and exchange isthe redundant storage of large amounts of persistent XMLdocuments. Compared to relational data tables, data repre-sented in XML format can potentially be even more sensi-tive to data quality issues because structure, besides textualinformation, may cause variations in XML documents repre-senting the same information entity. Therefore, correlatingXML documents, which are similar in content an structure,is a fundamental operation. In this paper, we present an ef-fective, flexible, and high-performance XML-based similar-ity join framework. We exploit structural summaries andclustering concepts to produce compact and high-qualityXML document representations: our approach outperformsprevious work both in terms of performance and accuracy.In this context, we explore different ways to weigh and com-bine evidence from textual and structural XML representa-tions. Furthermore, we address user interaction, when thesimilarity framework is configured for a specific domain, andupdatability of clustering information, when new documentsenter datasets under consideration. We present a thoroughexperimental evaluation to validate our techniques in thecontext of a native XML DBMS.

Categories and Subject DescriptorsH.2m [Database Management]: Miscellaneous—EntityResolution; I.5 [Pattern Recognition]: Clustering—Sim-ilarity measures

General TermsDesign, Experimentation

∗Supported by CAPES/Brazil under grant BEX1129/04-0†Work done at the University of Kaiserslautern, Germany.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IDEAS 2009, September 16-18, Cetraro, Calabria [Italy]Editor: Bipin C. DESAICopyright c©2009 ACM 978-1-60558-402-7/09/09 $5.00.

Keywordssimilarity joins, XML, similarity measures, entity resolution,xml databases, clustering

1. INTRODUCTIONThe quality of data stored in databases inevitably de-

grades over time. Common reasons for this behavior includedata entry errors, such as typographical errors and mis-spellings, incomplete information, and varying conventionsduring integration of multiple data sources storing overlap-ping information. A common consequence of poor data qual-ity is the appearance of the so-called fuzzy duplicates, i.e.,multiple and non-identical representations of a real-worldentity. Complications caused by such redundant informa-tion abound in common business practices, e.g., misleadingdownstream analysis due to erroneously inflated statistics,inability of correlating information related to the same cus-tomer, and repeated mailing operations. The problem ofidentifying fuzzy duplicates is commonly referred to as theentity resolution (ER) problem (also known as record link-age or near-duplicate identification).

As natural consequence of the widespread adoption ofXML as standard for information representation and ex-change, large amounts of persistent XML documents areredundantly stored in various places. Therefore, data rep-resented in XML format potentially influences and jeopar-dizes the data quality delivered to applications even morecompared to the relational world, because structure, besidestextual information, may cause variations in XML docu-ments representing the same information entity. Due tothe much greater modeling flexibility of the XML model,even documents containing the same information and follow-ing the same schematic rule (i.e., a DTD or XML Schema)may be arranged in quite different structures. Moreover, be-cause mapping flexibility and support of schema evolutionare prime motivations for using XML, structural changes arelikely to happen very often in many application domains.

Example 1. Consider the illustration in Figure 1, whichshows an XML tree containing patient demographic informa-tion from a hospital. Although subtrees a) and b) refer to thesame patient, it would be extremely difficult for an ER ap-plication to correctly identify them as fuzzy duplicates. Thedata in subtree a) is arranged by patient, while the data ofsubtree b) is arranged by study. Further, there is the ex-tra element relatives in subtree a). Moreover, there aretypos and use of different abbreviation conventions betweenthe content of the two subtrees.

entry

hospital

patient

description

name

mother

study

"Image CT"

"Bob"

"Alice"

id

"232"

entry

study

mother

id descriptionpatient

"Alic"

"222" "CT Image"name

"Rob"

relatives

a) b)

Figure 1: Heterogeneous XML data collection

Similarity join is a fundamental operation to support ERapplications. This particular kind of join operation employsa predicate of type sim (r, s) ≥ γ, where r and s are operandobjects to be matched, sim is a similarity function, and γ is aconstant threshold. In typical ER processes, pairs satisfyinga similarity join can be classified as potential duplicates andselected for later analysis.

1.1 Specific XML-related RequirementsTo perform similarity joins on XML datasets, the similar-

ity predicate should address the hierarchical structure of thedocuments. For such kinds of similarity matching in XMLtree structures, a large body of work is already available [3,2, 6, 12, 17, 25, 33]. However, as far as identification of fuzzyduplicates is concerned, all previous work suffers from somecombination of the following problems:

• Limited support for textual similarity : Many approachesfocus on the structural similarity only; text nodes areeither stripped away, before the matching process isinitiated, or only simple equality operations on textvalues are considered. While structural informationcertainly conveys useful semantic information, most ofthe discriminating power of real-life XML data, i.e.,the items of identifying information, which allow todistinguish documents in a collection from each an-other (e.g., keys), is assumed to be represented by textnodes. As a result, the resulting similarity notion isoften too “loose” for ER applications or ineffective fordatasets having poor textual quality.

• Common Schema Assumption: Some methods requirethat the documents to be matched follow the sameschema, e.g., to locate the textual description informa-tion of objects. However, large parts of today’s XMLdatasets are non-schematic or with multiple or evolv-ing schemas. In these common scenarios, the effort ofa unifying pre-step for structure alignment, when het-erogeneous and ad-hoc XML documents are present,is prohibitively time-consuming.

• Scalability : Several similarity functions proposed havean expensive runtime complexity or, even worse, donot easily lend themselves to effective optimizationssuch as derivation of bounds or minimization of pair-wise comparisons (candidate pruning). This limitationexcludes such functions from being used in operationson large datasets.

An alternative way to overcome the first drawback is todecouple textual and structural information by consideringthem as distinct document representations. Such a separa-tion is also important, because text and structure may have

different requirements regarding their units of representa-tion, i.e., their tokens, including aspects of feature identifi-cation and of weighting their relative importance. The useof multiple document representations brings up the issue ofcombining similarity evaluation results, which is commonlyreferred to as the combination of evidence problem.

The conceptual separation of text and structure for XMLdocument representation does not exclude, however, the ex-ploitation of structure to further improve the quality of tex-tual similarity results. To this end, a common approach inXML retrieval is to qualify each textual unit by the XPathcontext, i.e., the root-to-leaf navigational path, in which itappears [7]. For ER applications, such context segmentationis crucial for the following reasons: first, to prevent tokencomparison among unrelated concepts and, second, to re-strict the comparison of fields to those which are importantfor the resolution process, for example, while author andtitle are very significant in an XML database about pub-lications, year_of_publication is expected to be much lessinformative. Moreover, it avoids distortion of feature statis-tics, e.g., a feature can appear in several fields but withdifferent frequencies.

The ability of approximately matching navigational pathsis very important for dealing with structural heterogeneity.In this context, the similarity evaluation should account notonly for tokens appearing under identical paths, but alsofor the case when these paths are similar according to somesimilarity notion. A commonly used heuristic for the com-parison of hierarchically structured data is to exploit thenesting level and to assign higher scores for elements match-ing at lower nesting depths. Finally, performance optimiza-tions need to address two aspects to deal with large datasets.First, repeated path similarity evaluations should be avoidedfor tree patterns appearing several times in documents. Sec-ond, the set of tokens derived from XML trees can be po-tentially large requiring substantial overhead for pairwisecomparisons.

1.2 Our ContributionThis paper makes the following contributions. We present

an effective, flexible, and high-performance XML-based sim-ilarity join framework addressing all the issues above. Westart by defining a set-overlap-based similarity function forXML paths. By using the nesting level of path elements in amonotonically decreasing way, we are able to obtain mean-ingful results for frequent types of path variations. We thenapply this function for clustering the structural summary ofall paths in a target collection of XML documents. The re-sulting set of path clusters is used to equip the original struc-tural summary with the so-called Path Cluster Identifiers(PCIs), which identify all the paths belonging to the samecluster. Using PCIs, we are able to easily generate textualand structural representations of a document while keep-ing fine-granular feature statistics constrained at the clus-ter level. In this context, we explore approaches to weightstructural tokens and to combine different representations ofa document, namely techniques for combination of evidenceat the feature and score levels. We also devise a cluster pro-totype for each path cluster, i.e., a structure subsuming allpaths contained in a cluster. This structure is then usedto match (partial) paths against its cluster prototype. Thebenefit of cluster prototypes is two-fold: users can specifywhich part of the documents will constitute their textual

representation with vague knowledge about the underlyingstructure and existing clusters can be updated, as opera-tions such as insertion of a new document add a new pathto the structural summary. We quantify the accuracy andruntime performance of our approaches in the context of anative XML database management system.

The rest of this paper is organized as follows. After havingoutlined in Section 2 the related work, Section 3 gives thepreliminaries. In Section 4, we introduce our approach toXML path clustering used to derive token-based representa-tions from documents, whereas we present the correspondingsimilarity functions in Section 5. Aspects of the decompo-sition of a document into structural and textual representa-tions are covered in Section 6. Performance and accuracyresults for this novel approach are described in Section 7,before we wrap up with the conclusions in Section 8.

2. RELATED WORKEntity resolution is normally conducted in a very com-

plex process involving several stages, including data prepa-ration, candidate generation, data matching, and merging ofresults [36]. Similarity joins can play different roles in thisprocess [21]. Besides proving direct input to some classifi-cation model, similarity joins can, for example, be used toreduce the number of candidate pairs considered by other,usually much more expensive similarity measures—an op-eration known as blocking. As another example, similarityjoins can be used to support training-set construction formachine learning approaches [5].

Weis and Naumann [35] address ER in the context of XMLdatasets. Relevant content of each target object—either pre-defined or heuristically identified—is extracted from XMLdata and stored in a relation (object description), and stringsimilarity is applied to identify duplicates. This approach as-sumes all XML documents complying to the same schema;thus, its usage in ad-hoc or “schema-later” settings is im-practical.

A very large body of work deals with XML structuralsimilarity [12, 17, 25]. Most of these contributions addressthe problem of identifying documents valid for a similarDTD and have some of the drawbacks mentioned in Sec-tion 1.1. The tree edit distance [33]—and its many variants,e.g. [25]— is a popular measure for comparing trees. Whilethere are algorithms that compute the edit distance betweenordered trees in polynomial time, those for unordered treesare known to be NP-complete [37]. In data-centric XML,however, sibling order is unimportant and therefore suchdocuments are better modeled as unordered trees. Tacklingthis problem, Augsten et al. [2] present the concept of win-dowed pq-grams. Informally, the pq-grams of a tree are itssubtrees having a specific shape [3]. Windowed pq-gramsextend pq-grams to deal with unordered trees. Another al-ternative is based on the concept of path shingles [6]. Thepath-shingles approach generates tokens by prepending eachelement node in a tree with the full path from the root tothis node. Both techniques convert a tree into a set of to-kens and could be used in our similarity join framework. Wecompare our method with these approaches in Section 7.3.

Guha et al. [13] presented a framework for XML similarityjoins based on tree edit distance. To improve scalability,the authors optimize and limit distance computations byusing lower and upper bounds for the tree edit distance asfilters and a pivot-based approach for the partitioning of the

metric space. However, these computations are still in O(n2)and the reduction of pairwise distance calculations heavilydepends on a good choice of document “pivots” (referenceset, in their nomenclature).

Here, we assume that element tag labels are drawn froma common vocabulary. This assumption is often reasonablefor real-world heterogeneous XML stores, where structurecan be arbitrarily arranged, but there is some understandingof a set of interesting tags [23]. When not the case, theunderlying vocabularies have to be standardized before oursimilarity join algorithm takes place. But, vocabulary orontology integration is beyond the scope of this paper.

Exploitation of structure has been extensively investigatedin XML search [20]. In this context, the user may haveonly limited knowledge about the underlying structure ofan XML collection. Therefore, similarly to our context, textnodes have to be located approximately, i.e., the path to thespecific node does not need to match exactly the structuralconstraints expressed in the query. In [7], fuzzy and partialmatch of term contexts (root-to-leaf paths) is supported. Ashortcoming of this approach is that path similarity has to becomputed every time a query is evaluated. In our approach,path similarity is computed only once in a pre-processingstep as described in Section 4.2.

Clustering of tree-structured data is another active re-search topic [22, 19, 1, 11]. Common approaches computethe similarity within a cluster by exploiting additional knowl-edge sources (e.g., a thesaurus) and context information [22]or frequent structural patterns in a data collection [1]. Ourwork is similar in spirit to the bag-of-tree-paths model pro-posed in [19], which represent tree structures by a set ofpaths. Moreover, approaches combining structure and con-tent to achieve a specific clustering of XML documents havebeen proposed [11]. Here, we do not cluster XML documentsdirectly; rather, we cluster (much smaller) structural sum-maries to derive compact document representations. Fur-thermore, because we aim at supporting ER processes, wehave to support approximate matching on strings to accountfor textual deviations.

3. BACKGROUNDBefore we introduce our approach, we make the under-

lying assumptions explicit and review the most importantdefinitions of the subject area.

3.1 XML Data ModelXML documents are modeled as unordered labeled trees.

We distinguish between element nodes and text nodes, butnot between element nodes and attribute nodes; each at-tribute is child of its owning element. Disregarding othernode types such as Comment, we consider only data of stringtype. Given a node n, δ (n) corresponds to the label of n,if n is an element node, or to n’s value, if n is a text node.Henceforth, when no confusion arises, we use n to representan (text) element node as well its (value) label.

3.2 Textual XML RepresentationA common approach in IR is to characterize a document

by a profile, i.e., a compact representation of its content.This process is known as indexing and we refer to the unitsof representation as tokens. In this paper, our representationof XML documents comprises textual and structural infor-mation. In the following, we define textual units in isolation

(ignoring context) and defer the definition of structural unitsand context-aware textual representation to Section 5.1 andSection 5.2, respectively.

In IR, tokens are typically identified by words. For ERsystems, however, similarity matching has to be applied todeal with text deviations and, therefore, more fine-granularrepresentation units are necessary. The concept of q-gram iswidely used in this context. Informally, a q-gram of a strings is a substring of s of size q. The q-gram profile, denotedby I q (s), is the set of q-grams obtained from a string s.The common procedure used to produce a q-gram profileconsists on “sliding” a window of size q step by step overthe characters of s. It is useful to (conceptually) extendeach string s by prefixing and suffixing it with q − 1 nullcharacters (a null character is a special symbol not present inany string). Thus, each character contained in s participatesin exactly q (of a total of size (s) + q − 1) q-grams.

3.3 Indexing and WeightingMany experiments consistently revealed that not all to-

kens are equally important for similarity evaluation [30, 8].The process of quantifying this importance is referred to asweighting. Particularly, tokens that appear very frequentlyin a collection of documents contribute to the discriminationof them to a lesser degree, whereas rare tokens carry morecontent information and, thus, are more discriminative. Forthis reason, the inverse document frequency (idf ) is normallyused as token weight. The idf of a token t is inversely pro-portional to the total number of documents ft, in which tappears in a collection of N documents. A typical idf weightis given by idf (t) = ln (1 + N/ft). The term frequency (tf ),i.e., the frequency of a token in a document, is also used forweighting. Given a token t appearing fd,t times in a docu-ment, its tf weight can be computed as tf (t) = 1+ ln (fd,t).The product of both term statistics constitutes the well-known tf-idf weighting scheme.

The utility of the tf weight for string-to-string similar-ity comparison has been questioned [14]. In IR, an user-formulated query is compared to normally much larger doc-uments and it is reasonable to consider the frequency of aquery token in a document as an indication that the docu-ment is relevant to the query. However for string-to-stringcomparison, the intuition is just the opposite: frequencydiscrepancy of the same token present in two strings shoulddecrease their similarity. Moreover, ER systems usually per-form comparisons on much shorter strings, e.g., author namefields. A simple solution consists of converting a profile I,composed by a bag of tokens, into a profile Ia, composedby a set of annotated tokens, by appending an ordinal num-ber to each occurrence of a token. For example, the pro-file I = {a, b, b} is converted to Ia = {(a, 1) , (b, 1) , (b, 2)}.Annotated tokens have fd,t = 1 and their weights are de-termined by the idf weight only. Furthermore, divergencein term frequencies of a token will penalize the similarity ofthe related strings by a stronger degree, because subsequentoccurrences of the referenced token have decreased the doc-ument frequency in a collection (and, therefore, increasedidf weight).

3.4 Set-Overlap-Based Similarity FunctionsSet-overlap-based similarity functions evaluate the simi-

larity of two objects based on the number of tokens theyhave in common. It is well-known that similarity func-

tions are highly domain-dependent, where even functionsperforming well on average may poorly perform on somedatasets. Therefore, an important property of this classof functions is versatility: it is possible to obtain diversenotions of similarity by measuring the overlap differently[31]. Next, we formally define the weighted version of thewell-known Jaccard similarity, which will be used as build-ing block for other similarity functions defined further inthis paper; the weighted Jaccard has been shown to providecompetitive effectiveness against several other popular (andoften more computationally expensive) measures [8].

Definition 1. Let R be a set of tokens; wR (t) be the weightof token t relative to R according to some weighting scheme.Let the weight of R be given by wt (R) =

Pt∈R wR (t). Sim-

ilarly, consider a set S. The weighted Jaccard similarity(WJS) of R and S is defined as follows:

WJS (R, S) =wt (R ∩ S)

wt (R) + wt (S)− wt (R ∩ S)(1)

where the set overlap of R and S is given by:

wt (R ∩ S) =X

t∈R∩S

min (wR (t) ,wS (t)) .

3.5 Set Similarity JoinsIn the database community, there are two main processing

models for efficient evaluation of joins employing set-overlap-based similarity functions. The first uses an unnested rep-resentation of sets in which each set element is representedtogether with the corresponding object identifier. Queryprocessing commonly relies on relational database machin-ery: equi-joins supported by clustered indexes are used toidentify all pairs sharing tokens, or elements of set surro-gates, the so-called signature schemes [9], whereas groupingand aggregation operators together with UDFs are used forthe final evaluation [9]. In the second model, an index isbuilt for mapping tokens to the list of objects containingthat token [31, 4]—for self-joins, such index can be dynam-ically performed as the query is processed. The index isthen probed for each object to generate the set of candidateswhich will be later evaluated against the overlap constraint.Previous work has shown that approaches based on indexesconsistently outperform relational-based approaches [4] (seealso [14] for selection queries). As primary reason, a queryprocessing model based on indexes provides superior opti-mization opportunities. A major method for that is to useindex reduction techniques [4]; such techniques significantlyminimizes the number of tokens to be indexed as well as thenumber of tokens that need to remain indexed as the setsimilarity join is processed [29].

3.6 Combination of EvidenceIn the ER context, similarity matching results obtained

from different combinations of factors such as document rep-resentation, similarity function, and weighting schemes canbe viewed as multiple evidence that two objects are dupli-cates or not. Such pieces of evidence have to be merged in ameaningful manner before being used as input to some classi-fication model. This is also a concern for document retrievalsystems, which very commonly employ more than one re-trieval and representation model for improving effectivenessor have to combine results from several search engines—the

latter is known as the meta-search problem. In general, so-lutions proposed for the ER context are applicable to theIR context, and vice-versa. Common approaches consist oftraining classifiers—based on methods such as decision trees,Support Vector Machines (SVMs), and logistic regression—to learn a combination model [5, 10] or applying simplerfunctions such as the simple average on the (normalized)similarity values or on induced ranking to obtain a concili-ated result [10]. When using a single similarity function andweighting scheme, it is possible to combine different docu-ment representations at the token level. Using a single col-lection statistics from all document representations, a linearcombination weighted by the corresponding representationscan be employed to form token weights [26]. Alternatively,token weights can be directly obtained from token collec-tion statistics relative to each document representation inisolation [7]. Finally, multiple sources of evidence can becombined at token generation time. For example, one canproduce tokens that jointly capture structural and textualproperties of an XML tree [28].

4. XML PATH CLUSTERINGIn this section, we first define a similarity function for

navigational XML paths which is used in a cluster methodto group paths of a structural summary. We then employthese clusters to derive representations of XML documents.

4.1 Path Similarity FunctionWe consider paths formed by element nodes. Token sets

can be derived from paths by converting each element nodein a given path into an annotated token. The motivationof using annotated tokens is different from that discussed inSection 3.3. Here, token annotation serves to deal with pathscontaining element recursion, e.g., paths containing multipleoccurrences of the same node. For example, consider twopaths p1 = /a/b/a and p2 = /a/b and their correspondingprofile of annotated tokens Ia (p1) = {(a, 1) , (b, 1) , (a, 2)}and Ia (p2) = {(a, 1) , (b, 1)}. Therefore, we have Ia (p1) ∩Ia (p2) = {(a, 1) , (b, 1)}. Note that in this way, the matchingof tokens derived from recursive labels is done from low tohigh nesting levels.

Further, a weighting scheme can be applied to expressthe relative node significance in a path. In hierarchicallystructured data, more general concepts in the correspondingdomain are normally placed at lower nesting depths. Mis-matches between two paths on such low-level concepts maysuggest that the information contained in them are semanti-cally more “distant”. Therefore, an intuitive heuristics is toassign higher importance to nodes at lower nesting depths.Finally, given two paths represented as weighted sets, theirsimilarity can be assessed using a set-overlap-based similar-ity function as defined in 3.4. Next, we formally define theseconcepts.

Definition 2. Let p = (n0, . . . , nn) be a path of elementnodes, where node ni is at nesting level i. The level-basedweighting scheme assigns a weight to each path element, i.e.,LWS(p) = (n0, w0; . . . ; nn, wn) where:

wi = eβi (2)

and β ≤ 0 is a decay rate constant.

Definition 3. Let p1 and p2 be two paths. The weightedpath similarity (WPS) between p1 and p2 is given by:

WPS (p1, p2) = WJS (LWS (p1) ,LWS (p2)) (3)

using the weighted Jaccard defined in Equation 1.

Example 2. Consider p1 = patient/relatives/mother

and p2 = study/patient/mother, two (partial) paths ap-pearing in the document shown in Figure 1. Applying LWSwith decay rated β = −0.1, we obtain the following weightedsets: LWS (p1) = {(p, 1) , (r, 0.904) , (m, 0.818)} from p1,and LWS (p2) = {(s, 1) , (p, 0.904) , (r, 0.818)} from p2. Thetwo sets share the tokens patient and mother; the weightvalue of 0.904 for the token patient is returned by the min-imum function. Thus, the weighted overlap is 1.723. Theweighted norm of LWS (P1) and LWS (P2) is 2.723. Thus,WPS (p1, p2) = 1.723/ (2.723 + 2.723− 1.723) = 0.462.

We observe that, owing to the strictly decreasing weight-ing rule of LWS, WPS may yield non-intuitive results, whenapplied to long paths. For example, depending on the valueof decay rate β, two long paths sharing a smaller portion ofelement nodes at higher nesting levels, but having a largerpart of unrelated nodes at lower levels, will have highersimilarity according to WPS. One solution consists of mak-ing the weights constant from some level on. Therefore,the weighted norms in the denominator are kept sufficientlylarge to decrease the final result. On the other hand, empir-ical studies provide evidence that the vast majority of XMLdata worldwide has less than 8 levels [10] and, therefore, theabove effect is hardly an issue in practice.

4.2 XML Representation based on Path Clus-ter Identifiers

We now exploit WPS to cluster path classes, i.e., pathsthat occur at least once in at least one document in a col-lection. All path classes in an XML dataset are typicallymaintained by index structures called structural summaries(also known as 1-Index [24] or path synopsis [15]; see [24] fora formal definition). For example, the structural summary ofthe document shown in Figure 1 is depicted in Figure 2 (dis-regard the node identifiers and the accompanying table forthe moment). Path summaries are instrumental for efficientXPath query evaluation;1 therefore they are pervasively sup-ported by XML DBMSs. Here, our goal is to exploit pathsummaries to derive compact structural surrogates of XMLtrees: we use a cluster method to group similar path classesand then the resulting cluster information to generate therepresentation of documents.

In the following, we describe the path clustering processin detail. Given a structural summary of a document col-lection, we start by specifying a target label tg (l), corre-sponding to the entity to be matched (e.g., entry in Figure

1). Let P tg(l) be the set of all path classes relative to tg (l),i.e., paths in the structural summary having tg (l) as rootlabel. In case of nested occurrences of tg (l), we consideronly the paths rooted by the topmost occurrence. We thenuse WPS to generate a proximity matrix containing all pair-wise similarity values of P tg(l). This similarity matrix is theinput for a cluster method to generate a set of path clusters(partitions) on P tg(l). In this paper, we use the UPGMA

1In reference [15], path summaries are exploited for design-ing space-economic storage models.

entry

patient

description

name

mother

study

id

study

mother

id descriptionpatient

name

relatives

1 2

3

4 3 4

1 2

PCI1234 0.633

0.745

0.7450.745

θ

Figure 2: Structural summary equipped with PCIs

Agglomerative Hierarchical Clustering method with a user-specified threshold as cutting point in the dendrogram [18].2

We denote the set of path clusters generated from P tg(l) at

cutting threshold θ with PCtg(l)θ . A Path Cluster Identifier

(PCI) is used to distinguish the path clusters in PCtg(l)θ . Fi-

nally, we identify all path classes p ∈ PC in a structuralsummary by assigning to them the corresponding PCI. Forease of notation, let the i be the corresponding PCI of a path

cluster pci ∈ PCtg(l)θ . Figure 2 illustrates for tg (l) = entry,

θ = 0.6, and decay rate of β = −0.1 for LWS . The valuesin the box on the left are the similarity values at which theclusters were formed.

After having equipped a structural summary with PCIs,we are able to automatically derive a token-based structuralrepresentation of XML documents. For this purpose, wedecompose a document into a set of paths and, for eachpath p, the corresponding PCI is used as a structural token,i.e., i such that p ∈ pci. As a consequence, the structureof the document is represented as a set of PCIs, each tokendenoting an appearance of a path cluster element in thedocument. In Figure 2, note that subtrees, a) and b) arerepresented by the same PCI token set {1, 2, 3, 4}; as result,they would have maximum similarity according to any set-overlap-based similarity function, even though they have nopath in common. This latter observation highlights a salientfeature of our PCI-based representation: equality matchingof single tokens incorporates similarity matching of wholepaths for free. The actual path comparison is done onlyonce during the clustering process thereby avoiding repeatedpath similarity computations when evaluating the similarityjoin operation. Next, we describe our PCI-based similarityfunctions in detail.

5. XML TREE SIMILARITYIn this section, we first define the subsets of PC

tg(l)θ that

are used to produce token sets for structure and content;then we describe the corresponding similarity functions. Thesefunctions are combined to define similarity functions forXML trees. (In the following, we omit the target label andthe threshold cutting point in the notation for a set of path

clusters to avoid clutter, i.e., PCtg(l)θ = PC .)

Definition 4. Let C be a collection of XML documents;PC is a cluster set generated from the structural summaryof C. We decompose PC into two disjoint sets, PCs andPCt , such that PCs ∪ PCt = PC;3 PCt and PCs do notform a partition, however, because one or the other set canbe empty. We call PCs the set of structural path clustersand PCt the set of textual path clusters.

2Note that the specific clustering method is orthogonal to allthe methods used in this paper. Of course, other techniquessuch as K -means and DBSCAN are applicable.3Details of this decomposition are given in Section 6.

5.1 Structural SimilarityGiven an XML document d from C, let P s

d be the set ofpaths of d where each path is member of some path clusterin PCs, i.e., P s

d = {p : p ∈ pc and pc ∈ PCs}. Given P sd ,

we use the corresponding set of PCIs to obtain the profile ofannotated tokens Is

a(D) of D. PCI tokens are generated asdescribed in Section 4.2.

Definition 5. Let d1 and d2 be two XML documents andlet Is

a(d1) and Isa(d2) be their respective structural profile.

The structural similarity function (PCI-S) is given by:

PCI -S (d1 , d2 ) = WJS (I sa (d1 ), I s

a (d2 )) (4)

We now discuss the weighting scheme used for structuraltokens. In the definition of PCI-S, we use annotated tokens,which rules out tf -based weighting schemes. However, it ispossible to make a different rationale for structural tokensthan that for textual tokens (Section 3.3). For example,because data collections associated with an XML schemamay carry loose constraints on the data, e.g., allowing re-peating and optional elements, node frequency discrepan-cies can naturally emerge among documents. In this sense,the tf weighting component, as defined in Section 3.3, orany other weighting formula that dampens the effect of to-ken frequency in a document provides a good compromise.However, while this particular notion of similarity is reason-able when dealing with the identification of XML documentsvalid for a similar DTD [12, 17, 25], it is not clear whetheror not a similarity evaluation based on such an approachwould produce meaningful results in the ER context. Weempirically evaluate different weighting schemes for struc-tural tokens in Section 7.3.

5.2 Textual SimilarityGiven a document d from C, let ti be a text node of d

appearing in a path cluster pci ∈ PCt (or more precisely,in a path p ∈ pci). Let I q

a

`ti

´be the annotated set of q-

grams of ti as defined in Section 3.2. We convert I qa

`ti

´to

PCI q`I qa

`ti

´, i

´, the set of the so-called pci-grams of ti, by

appending i, the PCI value of the path cluster pci, to eachq-gram q ∈ I q

a

`ti

´. Further, let T t

d be the set of text nodes ti

of d appearing in some path cluster pci ∈ PCt. The profileof annotated pci-grams of d, denoted as PCIq

d , is given byPCI q

d = ∪ti∈T tdPCI q

`I qa

`ti

´, i

´.

Definition 6. Let d1 and d2 be two XML documents, andlet PCI q

d1 and PCI qd2 be their respective textual profile. The

textual similarity function, denoted as PCI-T , is given by:

PCI -T (d1 , d2 ) = WJS (PCI qd1,PCI q

d2) (5)

Note that, besides being used to compose the textual rep-resentation of a document, PCI also serves to approximatelylocate text nodes. For example, in Figure 1, we are ableto compare the text value of mother in subtrees a) andb) because their respective paths have associated the samePCI = 4 (see Figure 2). Further, we consider only the idfweighting scheme for textual tokens. The collection statis-tics of a token ti is constrained at the cluster path pci, i.e.,fti corresponds to the number of documents where token tappears in pci.

5.3 Similarity CombinationWe now consider ways to combine textual and structural

similarity. In this paper, we examine two approaches: score-level combination (SLC ) and token-level combination (TLC ).The first is the linear combination of the resulting scoresof PCI-S and PCI-T . In this approach, the similarityfunctions are evaluated independently, possibly using dif-ferent weighting schemes, and their result is combined usingweights that can be either hand-tuned or obtained from somelearning model. The second approach performs the union ofthe structural and textual profile of a document thereby ob-taining a unified representation. In this approach, we use thesame (idf ) weighting scheme for both textual and structuraltokens and a single similarity function. Note that textualtokens would have lower frequencies than structural tokens(and therefore greater idf weights). Next, we formally defineboth strategies.

Definition 7. Let d1 and d2 be two XML documents. Letλt and λs be weights such that λt + λs = 1. The score-levelsimilarity combination (SLC) between d1 and d2 is given by:

SLC (d1, d2) = λs × PCI -S (d1 , d2 ) + λt × PCI -T (d2 , d2 )(6)

Definition 8. Let d1 be an XML document. The combinedrepresentation of d1, denoted as Cd1

a , is given by Cd1a =

T sa (d1 ) ∪ PCI q

d1. Similarly, consider a document d2. Thetoken-level similarity combination (TLC) between d1 and d2

is given by:

TLC (d1, d2) = WJS“Cd1

a , Cd2a

”(7)

6. DECOMPOSITION OF TEXTUAL ANDSTRUCTURAL REPRESENTATIONS

So far, we have described how we accomplish XML pathclustering and how we compute textual and structural sim-ilarity. We are now ready to discuss the process of deter-mining the sets PCt and PCs. Because this decomposi-tion of the original set of clusters defines which parts of anXML document will deliver textual or structural tokens, itis a key design decision. As previously mentioned, textualinformation has more discriminating power than structuralinformation in XML documents, in general. However, sim-ply using all textual information available will hardly beeffective. Issues such as lack of independence are known tonegatively affect the quality of duplicate detection resultswhen more fields than needed are used [36]. Moreover, tex-tual tokens such as q-grams significantly increase the spaceoverhead, thereby directly having negative impact on perfor-mance. Therefore, the ability to select specific parts of thetextual information from XML documents is a fundamentalrequirement in our similarity join framework.

The above considerations about restrictions on the use oftextual information bring similar concerns about structuralinformation. The first question is whether to use it at all.The role of structure has been intensively explored in thecontext of XML retrieval. There exists empirical evidencethat using structure in queries improves precision at low re-call levels, but hurts performance at high recall levels [20]. Inthe ER context, the number of relevant answers is expectedto be substantially smaller in comparison to XML retrieval.4

4A common approach to evaluate the effectiveness of ER

In fact, because common real-world datasets contain smallamounts of fuzzy duplicates, most XML trees have eitherno duplicates or only a small number of them. Therefore,the fraction of relevant elements for which the use of struc-ture has been shown to enhance precision in XML retrieval(first retrieved elements), matches the typical recall range inthe ER context. Furthermore, from a semantic viewpoint,structure has special importance regardless of its discrimi-native power. For example, two XML documents having acompletely different structure would certainly be classifiedas non-duplicates. Finally, by representing the structure bya set of PCIs, we are able to derive very small token setscausing only a little performance penalty.

The definition of the best decomposition configuration iscritically dependent on the application domain. Hence, welet users specify the PCt set by issuing the so-called PCt

selection queries. This query consists of one or more simplepath expressions that are approximately matched againstthe set of clusters. The top-k answers, i.e., the k clustersets with highest similarity scores, are selected to constitutePCt; the remaining cluster sets form PCs. Because usersare likely to have only vague knowledge about the underly-ing structure of the data collection, we support exploratoryqueries for cluster decomposition, before the actual similar-ity join evaluation takes place. Currently, we return severalpieces of statistical information about the top-k results, suchas cluster frequency in the collection and mean text valuelength of text nodes appearing in the cluster set. The usercan therefore reformulate the query if, for example, the re-turned cluster set appears only in a few subtrees or is relatedto a part of the document with predominance of lengthy freetext. We plan to enhance user guidance on text-cluster setselection with information-theoretic techniques [32].

To avoid the burden of comparing path queries with allpath classes of a cluster, we employ a path cluster prototype,a structure subsuming all elements of a cluster. Path queriesare then matched with cluster prototypes rather than witheach path class. To this end, we adopt a simple but effectivelevel-based structure in which all labels appearing at thesame level are kept together. Figure 3a shows the prototypeof the path cluster with PCI 4 in our running example. Thecomparison between a PCt selection query and a prototypeis done using the WPS function, similarly to that of reg-ular paths. The main difference is that only a single labelmatch is allowed per level. For example, in the PCt selectionquery shown in Figure 3a, the study component cannot bematched with the cluster prototype because of the previousmatch of patient at level 1.

We design a specialized inverted list index to representthe set of cluster prototypes and efficiently match PCt se-lection queries against them. An inverted list is constructedfor each annotated label appearing in a cluster prototype.Figure 3b shows the inverted lists of the annotated labelsappearing in the example of Figure 3a. The space require-ment of the index is O (L× P ×D), where L is the num-ber of distinct annotated labels in the dataset—L can begreater than the number of distinct labels due to recursion,P is the number of distinct paths, and D is the maximum

algorithms consists in considering a document from thedataset as a query and the number of true duplicates ofthis document as the set of relevant answers. Therefore, theuse of standard IR evaluation measures is straightforward.We use this approach in this paper (see Section 7.2).

entry

patient

patientstudy relatives

mother

0

1

2

3

level element

patient

mother

PCt selection query

study. .

(a) Cluster prototype (PCI 4)

patient,11234

{1}{1}{1,2}{1,2}

annotatedtokens 4 {3}

PCI levels

1234

{1,2}{1,2}{1}{1}

. ...

patient,1 mother,1

(b) Inverted list of cluster prototypes

Figure 3: Path cluster prototype matching

path length among all documents. Clearly, this analyticalbound is “loose” and, in practice, the space consumption isexpected to be much smaller. Moreover, the vast majorityof XML datasets present fairly moderate values for L, P ,and D, e.g., less than 200 path classes and maximum depthof 8 [15]. Therefore, for such datasets, the dictionary as wellas the inverted list can be kept memory-resident.

Furthermore, we can ensure truly interactive responseseven under concurrent requests by employing well-known IRoptimizations such as the max-score strategy [34]. The orderof evaluation of the inverted lists follows the order of a PCt

selection query components. This means that the upperbound for the weight of matching labels decreases mono-tonically as the query is evaluated (recall that we use theminimum as aggregation function in Equation 1) and there-fore it is easy to compute the highest score that a clusterprototype containing all remaining components of the PCt

selection queries can achieve. Thus, we can promptly stopthe query evaluation and return the current top-k candidateas soon as there is no other candidate in the intermediateresult whose highest possible score is higher than the scoreof the kth candidate. In practice, only the inverted lists as-sociated with the first query label components will need tobe evaluated. Note that we are interested only in the top-kresults; their complete score is unimportant.

Besides PCt selection queries matching for cluster set de-composition, our inverted list representation of cluster pro-totypes is also important for (PCI-equipped) path summarymaintenance in the presence of incremental path class up-dates. New path classes are matched against the index toautomatically select the most similar path cluster. There-fore, the need of complete re-clustering is avoided after anew document is stored, or it can be postponed to be doneoff-line. A pre-defined threshold θ′ is used to define a mini-mum value of closeness between a new path class and clusterprototypes. If no cluster prototype is returned with similar-ity to the new path class not less than θ′, a new cluster iscreated and the path class is assigned to it. Typically, θ′ isdefined with the same value of the cutting threshold θ thatwas used to generated the initial set of path clusters (seeSection 4.2).

7. EXPERIMENTSAfter having presented our cluster-based approach to sim-

ilarity joins, we are ready to show our experimental re-sults. Our goals are: a) to evaluate the effectiveness of ourPCI-based similarity functions for tree-structured data andcompare it with other state-of-the-art token-based similar-ity functions; b) to investigate different weighting schemesfor structural tokens; c) to evaluate different approaches forthe combination of structural and textual similarities; d) tomeasure the performance of our similarity join frameworkon large datasets.

7.1 Datasets and SetupFor our empirical study, we used three real-world XML

datasets: Nasa5 containing astronomical data and two pro-tein sequence databases, SwissProt6, and PIR-PSD7 (PSDfor short). For all of them, we obtained sets of XML docu-ments by deleting the root node of each XML dataset. Theresulting documents are structurally heterogeneous. Nasais irregular, has large trees, deep nesting levels, and is moredocument-centric (larger textual content per subtree). Swis-sProt and PSD are both more regular than Nasa, but thetrees of SwissProt are larger and more shallow than thoseof PSD ; in addition, SwissProt has the largest number ofdistinct tags. Detailed statistics describing the datasets aregiven in Table 1.

For all experiments, we produced “fuzzy” copies of thesedatasets by performing random transformations on contentand structure. Injected errors on text nodes consist of wordswappings and character-level modifications (insertions, dele-tions, and substitutions). In all experiments, we applied1–5 such modifications for each dirty copy. These textualmodifications simulated typical data quality issues, such asdata entry errors. The structural modifications consist ofnode insertions and deletions, node inversions, and relabel-ing of nodes. Insertion and deletion operations follow the se-mantics of the tree edit distance [33], while node inversionsswitch the position between a node and its parent. Relabel-ing only changes the node’s name (with a DTD-valid sub-stitute). Additionally, we also applied modifications at thesubtree level (deletions of entire subtrees) and at the pathlevel (path deletions). We define error extent as the per-centage of nodes from a subtree which were affected by theset of structural modifications. We considered as affectedthe node which received the modification (e.g., a rename)as well as all its descendants. With these modifications, weaimed at simulating kinds of structural heterogeneity whichcommonly arises in the context of semi-structured data man-agement, such as heterogeneity owing to optional elements,schema evolution, and data integration. Finally, we clas-sify the fuzzy copies generated from each data set accordingto the error extent used, i.e., we have low (10%), moderate(30%), and dirty (50%) error datasets.

We conducted all experiments and performance measure-ments using our prototype XML database system calledXTC (XML Transactional Coordinator) [16]. All tests wererun on an Intel Xeon Quad Core 3350 2.66-GHz Intel Pen-tium IV computer (two 3.2-GHz CPUs, about 2.5-GB mainmemory, Java Sun JDK 1.6.0) as the XDBMS server ma-

5http://www.cs.washington.edu/research/xmldatasets/6http://us.expasy.org/sprot/7http://pir.georgetown.edu/

Table 1: Dataset statistics

dataset #subtreesavg #nodesper subtree

# distincttags/paths

avg/maxpath length

structure/content ratio

avg/max nodestring size

avg/max treestring size

Nasa 2435 371 69/73 6,76/7 1,43 30,45/12668 4646,68/153588SwissProt 50000 187 98/191 3,82/4 1,22 10,06/325 845,72/24029

PSD 262525 149 66/72 5,99/6 1,31 17,13/15535 1109,74/19074

chine where we configured XTC with a DB buffer of 2508-KB-sized pages.

7.2 Evaluation MetricsTo evaluate the accuracy of similarity functions, we used

our join algorithms as selection queries, i.e., as the specialcase where one of the join partners has only one entry. Weproceeded as follows. Each dataset was generated by firstrandomly selecting 500 subtrees from the original datasetand then generating 4500 duplicates from them (i.e., 9 fuzzycopies per subtree), all together totaling 5000 subtrees. Asthe query workload, we randomly selected 100 subtrees fromthe generated dataset. For each queried input subtree T ,the trees TR in the result returned by our similarity joinare ranked according to their calculated similarity with T .In this experiment, we did not use any threshold parameterand, therefore, the rank returned is complete. Finally, duringdata generation, we kept track of all duplicates generatedfrom each subtree; those duplicates form a partition andcarry the same identifier called duplicate ID. For each treeT , the set of corresponding relevant trees are those havingthe same duplicate ID. Here, we report the non-interpolatedAverage Precision (AP), which is given by:

AP =1

#relevanttrees×

nXr=1

[P (r)× rel (r)] (8)

where r is the rank, n the number of subtrees returned. P (r)is the number of relevant subtrees ranked before r, dividedby the total number of subtrees ranked before r, and rel (r)is 1, if the subtree at rank r is relevant and 0 otherwise. Thismeasure emphasizes the situation where more relevant doc-ument fragments are returned earlier. The mean of the APover the query workload is reported (abbreviated to MAPin our experimental charts). We also experimented withseveral other evaluation metrics such as the 11-point inter-polated average precision and the F1 measure and obtainedconsistent results.

For performance experiments, we evaluated the algorithmsaccording to their overall execution time (recorded as theaverage over 5 runs) and according to the token set size pro-duced, which has a direct impact on the execution time.

7.3 Structural Similarity ResultsIn this first experiment, our objective was two-fold: 1)

evaluating the effectiveness of our structural similarity mea-sure in identifying fuzzy duplicates and comparing it tocompeting approaches; and 2) comparing different weightingstrategies for structural tokens. To evaluate the contributionof the structural similarity measure in isolation, we ignoredall text nodes in this experiment, i.e., we made PCt = ∅.

We compared our PCI-based similarity function (PCI-S)against two other token-based approaches: windowed pq-grams (WPQ) and path-shingles (PS). For the two compet-

ing approaches, we used the parameters as recommended bythe authors, i.e., p = q = 2 and w = 3 for WPQ and awindow of size 1 for PS. In this and all subsequent experi-ments, we used a decay rate of 0.1 for LWS and a thresholdof 0.4 as cutting point for the dendogram. We observed thebest results with this configuration. Finally, we also consid-ered an additional similarity function (PATH ), which simplyused the weighted Jaccard formulation (WJS)—as definedin Section 3.4—between the sets of root-to-leaf paths of twotrees. Thus, we could observe the specific effects of our pathsimilarity approach to the result quality.

Further, we present results for four different weightingschemes: unweighted (UNWEI ), inverse document frequency(IDF ), term frequency (TF ), and tf-idf (TF-IDF). For UN-WEI and IDF, we used WJS. On the other hand, TF andTF-IDF are based on local statistics (term frequency). Asa result, token weights can vary among different sets. Us-ing minimum as aggregation function ensures that similar-ity results are in the interval [0, 1]. However, we experiencedunstable results with this formulation. Hence, we refrainedfrom using Jaccard and, instead, employed the well-knowncosine similarity with TF and TF-IDF weighting schemes[30]. We also tested the cosine version of IDF ; however, weconsistently observed superior results with Jaccard.

Figure 4 shows the results. In all the settings, PCI-S isthe most effective. Moreover, PCI-S is the most resistant,when the error extent increases, and shows only slight accu-racy degradation as the error extent increases from 10% to50% (low to dirty dataset). On the other hand, the accu-racy of the other three similarity measures, WPQ, PS, andPATH, drops significantly as the error extent increases; itpractically decreases at the same rate. WPQ performs bet-ter than PS and PATH, but it is close to PCI-S only onlow-error datasets.

To better understand the reasons for the superiority ofPCI-S over the other approaches, we conducted additionalexperiments on datasets generated by only one kind of struc-tural modification. We observed that the results favoringPCI-S are, in general, more prominent on datasets generatedby node-level modifications, i.e., node insertion, rename, in-version, and deletion. This observation is not a surprise,because the other measures cannot capture modificationsthat change the root-to-leaf path of an element.

All similarity functions show better results on Nasa. Thisis expected because Nasa has the most irregular structure,therefore providing a good structural “signature” to iden-tify a subtree. Accordingly, the worst results are observedfor the PSD dataset, which has the most regular structureand many subtrees have a very similar shape. Interestingly,the accuracy gap between PCI-S and the other measures islarger in this restricted feature domain.

The better results of PCI-S are even more impressivewhen we observe the average token set size delivered by eachrepresentation in Table 2 . For all datasets, PCI-S produces

UNWEI IDF TF TF-IDF

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

MAP

PCI-S

WPQ

PS

PATH

(a) Nasa, low-error datasetUNWEI IDF TF TF-IDF

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

MAP

PCI-S

WPQ

PS

PATH

(b) Nasa, moderate-error datasetUNWEI IDF TF TF-IDF

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

MAP

PCI-S

WPQ

PS

PATH

(c) Nasa, dirty-error dataset

UNWEI IDF TF TF-IDF

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

MAP

PCI-S

WPQ

PS

PATH

(d) SwissProt, low-error datasetUNWEI IDF TF TF-IDF

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

MAP

PCI-S

WPQ

PS

PATH

(e) SwissProt, moderate-error datasetUNWEI IDF TF TF-IDF

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

MAP

PCI-S

WPQ

PS

PATH

(f) SwissProt, dirty-error dataset

UNWEI IDF TF TF-IDF

0

0,1

0,2

0,3

0,4

0,5

0,6

MAP

PCI-S

WPQ

PS

PATH

(g) PSD, low-error datasetUNWEI IDF TF TF-IDF

0

0,1

0,2

0,3

0,4

0,5

0,6

MAP

PCI-S

WPQ

PS

PATH

(h) PSD, moderate-error datasetUNWEI IDF TF TF-IDF

0

0,1

0,2

0,3

0,4

0,5

0,6

MAP

PCI-S

WPQ

PS

PATH

(i) PSD, dirty-error dataset

Figure 4: MAP values for different structural similarity functions on different datasets

Table 2: Average structural token set size

Nasa SwissProt PSDPCI-S 143 79 60

PS 360 181 145WPQ 693 295 262

token sets about 5 times shorter than WPQ. To be resistantto subtree permutations, WPQ has to produce an increasednumber of pq-grams [2], while for PCI-S the number of to-kens is always bounded by the number of paths. Moreover,the use of Path Cluster Identifiers provides a very compactrepresentation for free, without the need of using any codi-fication method such as a fingerprint hash function.

We now analyze the results of the comparison of weight-ing schemes. The main observation is that UNWEI and IDFperform consistently better than TF and TF-IDF. This re-sult might be due to the use of Jaccard for UNWEI andIDF. In general, unweighted and IDF -weighted versions ofJaccard show very similar results for all similarity functions,with a slight advantage for the former. It seems that thedistribution of structural tokens does not provide any sig-nificant advantage for IDF -weighting schemes. We also notethat TF significantly outperforms TF-IDF in all token setrepresentations. Similar results have been observed in thecontext of clustering web pages [27]. The common conjec-ture is that TF-IDF over-emphasizes the rarest terms. Inour case, we believe that mismatches on such over-weighted(structural) tokens penalize too much the similarity result.

7.4 Accuracy Results for Text and StructureCombination

Here, we compare approaches to combine textual and struc-tural similarity. We evaluate the methods presented in Sec-

tion 5.3: score-level combination (SLC ) and token-level com-bination (TLC ). For SLC, we report the results using theidf weighting scheme with Jaccard as similarity function (asinitially defined). We didn’t observe any accuracy gain byusing unweighted structural tokens. Also, normalizing thescore results provided no accuracy enhancing, so the resultsreported were obtained without any normalization method.In addition, we also compare the epq-grams (EPQ) approach[28], an extension of pq-grams to capture textual informa-tion. Note that pq-grams and, in turn, EPQ can only han-dle ordered XML trees (in contrast to windowed pq-gramsor our PCI -based methods). However, as we didn’t applynode-swapping operations when generating the datasets, ourcomparison here is fair. The EPQ approach employs the so-called path predicates, a different mechanism to support se-lection of specific document parts used for textual represen-tation. Path predicates support partial matching (e.g. //a ),but not approximate matching as PCt selection queries. See[28] for details. Finally, to better understand the effects ofcombining structural and textual similarity, we also reportthe results concerning the evaluation of PCI-S and PCI-Tin isolation.

The following PCt selection queries (path predicates) wereused for the PCI-based methods (epq-grams): /dataset/name,(Nasa), /Entry/Ref/Author (SwissProt), and /ProteinEn-

try/sequence (PSD). We only report the results for moderate-error datasets due to space constraints. We used q = 3 forthe PCI-based functions and p = 2 and q = 3 EPQ. Inthe experimental charts, we indicate for SLC the structuralweight (λs) we used to achieve the best results. For exam-ple for SwissProt, we obtained better performance with astructural (textual) weight of 0.7 (0.3).

Figure 5 shows the results. TLC has the best results,in general, and exhibits good performance in all datasets.

PCI-S PCI-T TLC SLC-0.5 EPQ

0,9

0,92

0,94

0,96

0,98

1 MAP

(a) Nasa


0,6

0,7

0,8

0,9

1,0 MAP

(b) SwissProt


0,5

0,6

0,7

0,8

0,9

1 MAP

(c) PSD

Figure 5: Complete accuracy results on moderate-error datasets

Table 3: Average token set size for TLC and EPQ

SwissProt PSDTLC 198 432EPQ 417 903

Moreover TLC shows better accuracy than PCI-T and PCI-S in isolation in all settings. Even when PCI-T (on Swis-sProt) or PCI-S (on PSD) provided poor results, the over-all accuracy of TLC never worsened. This result is similarto the findings of [26], which observed that good documentrepresentations tend to be robust when combined with otherpoorly performing representations. SLC obtained good re-sults except for SwissProt. For this dataset, PCI-T alsoshowed very poor results. A closer examination revealedthat the clustering method grouped some unrelated paths inthe same cluster. In this situation, TLC was able to lever-age the results of PCI-S and didn’t suffer any significantdegradation w.r.t. accuracy. Finally, EPQ provided goodresults on PSD and Nasa; it was even able to slightly out-perform TLC on Nasa. Similarly to the PCI -based methodson SwissProt, the path-predicate mechanism was not able tocorrectly locate its text nodes.

7.5 Performance ResultsThis last experiment quantifies the runtime performance

of TLC, SLC, and EPQ for varying query thresholds. Weused fuzzy copies of the SwissProt and PSD datasets, eachcopy generated with an error extent of 30% (moderate-error)and containing 100k subtrees. We implemented the w-mpjoinalgorithm, a (self) set similarity join algorithm presented in[29]. For SLC, we first evaluated PCI-T and afterwardstextitPCI-S; therefore this query was evaluated in conjunc-tive mode, where only the subtrees returned by PCI-T wereconsidered by PCI-S. The input of w-mpjoin is a set collec-tion sorted in increasing order of the set size; the elementsof each set are sorted in decreasing order of the documentfrequency. We sorted the input in advance, so the inputsorting time is not contained in the results.

The results are shown in Figure 6. The runtime perfor-mance of both TLC and SLC is practically identical. EPQis slower than TLC and SLC by a factor of up to 2.5. Thisresult reflects the average token-set size produced by eachapproach, which is given in Table 3. EPQ delivers about thedouble size of the PCI-based approaches on both datasets.All algorithms are about 6 times faster on the SwissProtdataset. The reason is that the token sets generated fromSwissProt are smaller and, more importantly, have a widerset-size distribution, which is exploited by the w-mpjoin al-gorithm to optimize join processing [29].

7.6 Experimental SummaryPCI-S is clearly the structural XML representation of

choice: it provides superior accuracy results, its performancedegrades only moderately as the dataset level of “ dirtyness”increases, and it is able to correlate reasonably well treeswith restricted structural information. Moreover, PCI-S de-livers much smaller token sets—bounded by the number ofpaths in a tree, the tokens lend themselves to very compactrepresentations, and can be generated for free with supportof a PCI-equipped structural summary.

We found the Jaccard similarity being superior to Cosine.Looking at evaluation of structural similarity in isolation, wedidn’t observe any significant gain using weighting schemes.

Regarding the combination of structural and textual sim-ilarity, TLC is the winner. It is able to leverage very wellboth kinds of similarity providing stable results even whenthe textual or structural similarity performs poorly. More-over, TLC runs faster in our similarity join framework.

8. CONCLUSIONIn this paper, we presented an XML-based similarity join

framework, which is amenable to an integration into nativeXML database systems. We exploited structural summariesand clustering concepts to represent XML documents as to-ken sets that capture common textual and structural devi-ations occurring in real-world XML datasets. Remarkably,our approach delivered substantially more compact tokensets than competing approaches, while providing more ac-curate results. In this context, we explored different waysto weigh and combine evidence from textual and structuralXML representations. Finally, we addressed user interac-tion, when the similarity framework is configured for a spe-cific domain, and updatability of clustering information, whennew documents enter datasets under consideration, by devis-ing a cluster prototype structure represented as little memory-resident inverted lists.

9. REFERENCES[1] C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. J.

Zaki. Xproj: a framework for projected structuralclustering of xml documents. In Proc. KDD Conf.,pages 46–55, 2007.

[2] N. Augsten, M. H. Bohlen, C. E. Dyreson, andJ. Gamper. Approximate joins for data-centric xml. InProc. ICDE Conf., pages 814–823, 2008.

[3] N. Augsten, M. H. Bohlen, and J. Gamper.Approximate matching of hierarchical data usingpq-grams. In Proc. VLDB Conf., pages 301–312, 2005.

[4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up allpairs similarity search. In Proc. WWW Conf., pages

0,7 0,75 0,8 0,85 0,9 0,95

Threshold

0

10

20

30

Tim

e (

seco

nds)

TLC

SLC

EPQ

(a) SwissProt

0,7 0,75 0,8 0,85 0,9 0,95

Threshold

0

40

80

120

160

Tim

e (

seco

nds)

TLC

SLC

EPQ

(b) PSD

Figure 6: Runtime experiments

131–140, 2007.

[5] M. Bilenko and R. J. Mooney. On evaluation andtraining-set construction for duplicate detection. InProc. KDD Workshop on Data Cleaning, RecordLinkage, and Object Consolidation, pages 7–12, 2003.

[6] D. Buttler. A short survey of document structuresimilarity algorithms. In Proc. Intl. Conf. on InternetComputing, pages 3–9, 2004.

[7] D. Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass,and A. Soffer. Searching xml documents via xmlfragments. In Proc. SIGIR Conf., pages 151–158, 2003.

[8] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi,and D. Srivastava. Benchmarking declarativeapproximate selection predicates. In Proc. SIGMODConf., pages 353–364, 2007.

[9] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitiveoperator for similarity joins in data cleaning. In Proc.ICDE Conf., page 5, 2006.

[10] W. B. Croft. Combining approaches to informationretrieval. Advances in information retrieval, 7:1–36,2000.

[11] L. Denoyer and P. Gallinari. Overview of the inex2008 xml mining track. In S. Geva, J. Kamps, andA. Trotman, editors, Proc. INEX 2008, LNCS, 2009.

[12] S. Flesca, G. Manco, E. Masciari, L. Pontieri, andA. Pugliese. Fast detection of xml structuralsimilarity. TKDE, 17(2):160–175, 2005.

[13] S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava,and T. Yu. Integrating xml data sources usingapproximate joins. TODS, 31(1):161–207, 2006.

[14] M. Hadjieleftheriou, A. Chandel, N. Koudas, andD. Srivastava. Fast indexes and algorithms for setsimilarity selection queries. In Proc. ICDE Conf.,pages 267–276, 2008.

[15] T. Harder, C. Mathis, and K. Schmidt. Comparison ofcomplete and elementless native storage of xmldocuments. In Proc. IDEAS Conf., pages 102–113,2007.

[16] M. P. Haustein and T. Harder. An efficientinfrastructure for native transactional xml processing.DKE, 61(3):500–523, 2007.

[17] S. Helmer. Measuring the structural similarity ofsemistructured documents using entropy. In Proc.VLDB Conf., pages 1022–1032, 2007.

[18] A. K. Jain and R. C. Dubes. Algorithms for ClusteringData. Prentice Hall College Div, 1988.

[19] S. Joshi, N. Agrawal, R. Krishnapuram, and S. Negi.A bag of paths model for measuring structuralsimilarity in web documents. In Proc. KDD Conf.,pages 577–582, 2003.

[20] J. Kamps, M. Marx, M. de Rijke, and

B. Sigurbjornsson. Articulating information needs inxml query languages. TOIS, 24(4):407–436, 2006.

[21] N. Koudas, S. Sarawagi, and D. Srivastava. Recordlinkage: similarity measures and algorithms. In Proc.SIGMOD Conf., pages 802–803, 2006.

[22] M.-L. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust:clustering xml schemas for effective integration. InProc. CIKM Conf., pages 292–299, 2002.

[23] Z. H. Liu and R. Murthy. A decade of xml datamanagement: An industrial experience report fromoracle. In Proc. ICDE Conf., pages 1351–1362, 2009.

[24] T. Milo and D. Suciu. Index structures for pathexpressions. In Proc. ICDT Conf., pages 277–295,1999.

[25] A. Nierman and H. V. Jagadish. Evaluating structuralsimilarity in xml documents. In Proc. WebDBWorkshop, pages 61–66, 2002.

[26] P. Ogilvie and J. P. Callan. Combining documentrepresentations for known-item search. In Proc. SIGIRConf., pages 143–150, 2003.

[27] D. Ramage, P. Heymann, C. D. Manning, andH. Garcia-Molina. Clustering the tagged web. In Proc.WSDM Conf., pages 54–63, 2009.

[28] L. A. Ribeiro and T. Harder. Evaluating performanceand quality of xml-based similarity joins. In Proc.ADBIS Conf., pages 246–261, 2008.

[29] L. A. Ribeiro and T. Harder. Efficient set similarityjoins using min-prefixes. In Proc. ADBIS Conf. (toappear), 2009.

[30] G. Salton and M. J. McGill. Introduction to ModernInformation Retrieval. McGraw-Hill, Inc. New York,1983.

[31] S. Sarawagi and A. Kirpal. Efficient set joins onsimilarity predicates. In Proc. SIGMOD Conf., pages743–754, 2004.

[32] F. Sebastiani. Machine learning in automated textcategorization. CSUR, 34(1):1–47, 2002.

[33] K.-C. Tai. The tree-to-tree correction problem. JACM,26(3):422–433, 1979.

[34] H. R. Turtle and J. Flood. Query evaluation:Strategies and optimizations. Information ProcessingManagement, 31(6):831–850, 1995.

[35] M. Weis and F. Naumann. Dogmatix tracks downduplicates in xml. In Proc. SIGMOD Conf., pages431–442, 2005.

[36] W. Winkler. Overview of record linkage and currentresearch directions. Technical report, StatisticalResearch Division, U.S. Bureau of the Census, 2006.

[37] K. Zhang, R. Statman, and D. Shasha. On the editingdistance between unordered labeled trees. InformationProcessing Letters (IPL), 42(3):133–139, 1992.

a cluster-based approach to xml similarity · pdf file · 2009-07-21a cluster-based...

Documents