international workshop on web and databasesstorage engine postgresql rank optrank storage engine...

Proceedings of the Ninth WebDB Workshop

International Workshop on Web and Databases

Chicago, Illinois June 30, 2006

Edited by: Alin Deutsch and Wenfei Fan

Foreword

The ninth edition of the International Workshop on the Web and Databases (WebDB) continues the tradition of promoting novel research on next-generation information systems, encouraging interaction between researchers and practitioners working on problems at the intersection of data management and the Web. The focus of this edition of the workshop is on fundamental issues in Web services and applications. Specifically, the WebDB Program Committee selected for publication 11 papers among 48 submissions covering the following topics of interest:

• Models and infrastructure for the management of Web services • Discovery, synthesis and composition of Web services and applications • Web search and distributed information retrieval • Web mining, exploration, and visualization • Web privacy and security • Schema matching and mapping • Ontology matching • Data integration • Integration of text into XML and relational databases • XML query processing and data management • Peer-to-peer search networks • Data stream management systems

We hope that this collection of papers will be useful to the reader interested in data management on the Web.

Alin Deutsch and Wenfei Fan

July 2006

WebDB 2006 Organisation

Workshop Co-Chairs Alin Deutsch Univ of California, San Diego, USA Wenfei Fan Bell Labs, USA and Univ of Edinburgh, UK Program Committee Serge Abiteboul INRIA and LRI-Univ Paris 11, France Denilson Barbosa Univ of Calgary, Canada Michael Benedikt Bell Labs, USA Phil Bohannon Bells Labs, USA Susan Davidson Univ of Pennsylvania, USA Floris Geerts Univ of Edinburgh, UK Georg Gottlob Technical Univ of Vienna, Austria Vagelis Hristidis Florida International Univ, USA Zack Ives Univ of Pennsylvania, USA H. V. Jagadish Univ of Michigan, Ann Arbor, USA Anastasios Kementsietsidis Univ of Edinburgh, UK Christoph Koch Univ of Saarland, Germany Hank Korth Lehigh University, USA Maurizio Lenzerini Univ "La Sapienza", Italy Ioana Manolescu INRIA, France Gerome Miklau Univ of Massachussetts Amherst, USA Frank Neven Limburg Univ., Belgium Raymond Ng Univ of British Columbia, Canada Michalis Petropoulos SUNY Buffalo, USA Dan Suciu Univ of Washington, USA Wang-Chiew Tan Univ of California, Santa Cruz, USA Victor Vianu Univ of California, San Diego, USA Limsoon Wong National Univ of Singapore Proceedings Chair Dayou Zhou Univ of California, San Diego, USA Web Chair Yannis Katsis Univ of California, San Diego, USA Workshop Web Site http://db.ucsd.edu/webdb2006

http://db.ucsd.edu/webdb2006

Table of Contents Invited Talks…………………………………………………………………………………1 Rick Hull (Bell Labs) and Renee Miller (University of Toronto) Structural Text Search and Comparison Using Automatically Extracted Schema……..2 Michael Gubanov (University of Washington), Philip A. Bernstein (Microsoft Research) Exploiting Community Behavior for Enhanced Link Analysis and Web Search.……....8 Julia Luxenburger, Gerhard Weikum (Max Planck Institute of Informatics) Vision-based Web Data Records Extraction………………………………………….…..14 Liu Wei, Meng Xiaofeng, Meng Weiyi (Renmin University of China) Answering Structured Queries on Unstructured Data……………………………….….20 Jing Liu, Xin (Luna) Dong (University of Washington), Alon Halevy (Google Inc.) Twig Patterns: From XML Trees to Graphs…………………………………….…….…26 Benny Kimelfeld, Yehoshua Sagiv (Hebrew University) The Meaning of Erasing in RDF under the Katsuno-Mendelzon Approach………..….30 Claudio Gutierrez, Carlos Hurtado, Alejandro Vaisman (Universidad de Chile) Amoeba Join: Overcoming Structural Fluctuations in XML Data………………….….38 Taro Saito, Shinichi Morishita (University of Tokyo) Replication-Aware Query Processing in Large-Scale Distributed Information Systems………………………………………………………………………………….…..44 Jie Xu, Alexandros Labrinidis (University of Pittsburgh) Automatic Tuning of File Descriptors in P2P File-Sharing Systems…………………...50 Dongmei Jia, Wai Gen Yee, Ophir Frieder (Illinois Institute of Technology) KISS: A Simple Prefix Search Scheme in P2P Networks…………………….…………56 Yuh-Jzer Joung, Li-Wei Yang (National Taiwan University) Global Document Frequency Estimation in Peer-to-Peer Web Search………………..62 Matthias Bender, Sebastian Michel (Max Planck Institute of Informatics), Peter Triantafillou (RACTI and University of Patras), Gerhard Weikum(Max Planck Institute of Informatics)

WebDB 2006 Invited Speakers

New Challenges in Schema Mapping

Renee Miller

Renée J. Miller received BS degrees in Mathematics and in Cognitive Science from the Massachusetts Institute of Technology. She received her MS and PhD degrees in Computer Science from the University of Wisconsin in Madison, WI. She received the 1997 Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received the National Science Foundation Early Career Award (formerly, the Presidential Young Investigator Award) for her work on heterogeneous databases and was named an Ameritech Faculty Fellow. Her research interests are in the efficient, effective use of large volumes of complex, heterogeneous data. This interest spans heterogeneous databases, knowledge curation and data sharing. She is a Professor and the Bell University Lab Chair of Information Systems at the University of Toronto.

Towards a Unified Model for Web Service Composition

Rick Hull

Rick Hull is the Director of the Network Data and Services Research Department (formerly named "Database Systems Research Department") at Bell Laboratories, the research and development division of Lucent Technologies. He received B.A. degrees in Mathematics and Philosophy from the University of California, Santa Barbara, and M.S. and Ph.D. degrees in Mathematics from the University of California, Berkeley. He served on the faculty of Computer Science at the University of Southern California from 1980 until 1994. From 1993 to 1996 he was a visitor in the Department of Computer Science at the University of Colorado, Boulder. Also, he has been a frequent visitor to the Verso Group at INRIA, near Paris, France. Prior to joining Bell Labs, his research was supported in part by grants from NSF, ARPA, AT&T, and U S WEST. Hull works in areas related to the convergence of data and services, including research on e-services, workflow, policy management, personalization, data integration, telecom applications, languages, and theory. He is co-author of the book "Foundations of Databases" (Addison-Wesley, 1995), and is (co-)author of over 100 refereed journal and conferences articles.

http://www.cs.toronto.edu/~miller/pecase.html

http://www.nsf.gov/cgi-bin/showaward?award=9702974

http://www.nsf.gov/cgi-bin/showaward?award=9702974

http://db.bell-labs.com/index.html

http://www.bell-labs.com/

http://www.lucent.com/

http://www.usc.edu/dept/cs/

http://www.usc.edu/

http://www.cs.colorado.edu/

http://www.colorado.edu/

http://www-rocq.inria.fr/verso

http://www-rocq.inria.fr/verso

http://www-rocq.inria.fr/

Structural text search and comparison using automaticallyextracted schema

Michael GubanovUniversity of Washington

[email protected]

Philip A. BernsteinMicrosoft Researchg

[email protected]

ABSTRACTAn enormous amount of unstructured information is presenton the web, in product manuals, e-mails, text documents,and other information sources. However, there is not enoughsupport to automatically infer sufficient structure from thesedata sources to be able to pose queries comparable in powerto SQL.

We present a prototype of new text database manage-ment system capable to automatically infer schema fromtext using natural language processing. It leverages ex-tracted schema by supporting powerful structural search andfuzzy join operator between extracted entities.

1. INTRODUCTIONDespite vast amount of unstructured data on the web,

keyword-search [7] is often the only way to find needed infor-mation. PageRank - Google’s algorithm to rank web pagesand display the best ranked pages first to the user currentlydepends on more than 500 million variables and 2 billion(!)terms. In addition, PageRank also analyzes the full contentof a page and factors in fonts, subdivisions, the precise loca-tion of each word, and the content of neighboring web pages[1]. To summarize, it is, probably, the world’s most complexalgorithm and it is getting even more complicated every day,because Google is working to improve it!

By contrast, System R [16] was the first relational data-base management system prototype that introduced a re-lational algebra engine for storing and querying structureddata. Structured Query Language(SQL) - a powerful lan-guage was born from relational algebra with the purpose toquery structured data represented as entities with attributesand relationships between them or a database schema. SQLmade possible to focus user query to a specific structurewithin the database schema and retrieve quickly only theneeded information thus leveraging the structure and get-ting focused and precise answers to the query. Of course, ifneeded, it is possible to do keyword-search over databaseslargely ignoring available structure (e.g. [4]).

Inferring structure (or schema) is a key problem in anysolution trying to support a richer query language than key-word search over unstructured data in order to provide morefocused and precise results. Our main contributions are thefollowing novel algorithms that comprise a new text data-base management system (TDBMS).

Copyright is held by the author/owner. Ninth International Workshop on theWeb and Databases (WebDB 2006), June 30, 2006, Chicago, Illinois..

It is implemented on top of a relational engine.

• a fully automatic algorithm to extract schema fromtext and perform structural search using the schema

• algorithms for fuzzy join between entities in the ex-tracted schema and ranking join results

• a concept matching algorithm to detect similar con-cepts expressed by different words in text

We apply our results to perform structural search andautomatic software comparison by extracting schema fromfreely available product manuals. We convert the manualsto plain text before running the algorithms (plain text filesare 3-4 Mb each).

InnoDB

InnoDB offers all four transaction isolation levels de-scribed by the SQL standard

InnoDB provides full ACID compliance

InnoDB supports multiple granularity locking whichallows coexistence of record locks and locks on entiretables

...

Table 1: Structural search on InnoDB/supports

Structural search is focused on the extracted structure andtherefore returns more precise results. For instance, havingextracted the schema from the MySQL manual we can issuea selection query on entity InnoDB1 and its attribute support(and its synonyms and grammatical forms). This returns 26sentences, three of which are shown in Table 1. By contrast,keyword search on the same data using keywords InnoDB+ (provide or support or offer) returns near 50 sentences, ≈35% of which were not matching the focus of a structuredquery ( e.g. “You can omit these command lines if youto not require InnoDB or BDB support”). Thus, structuralsearch is more focused and therefore performs more preciselythan keyword-search at the expense of coverage (in this case,missed 4 useful sentences (e.g. “Support for XA transactionsis available for the InnoDB storage engine”).

Next, we applied our fuzzy join algorithm to compare in-dexing support in PostgreSQL and MySQL database servers.Comparing software is a widely known complex problem. Itis especially important for large enterprises that commonlyhave to choose between expensive products. Usually, the

1one of the MySQL storage engines

Storage engine PostgreSQL rank optrank

Storage engine allowable index types are MyISAM: B-TREE, InnoDB: B-TREE, Memory: HASH B-TREE

PostgreSQL provides several index types: B-tree R-tree Hash GiST

4.2 1 615

MEMORY storage engine implements both HASH andB-TREE indexes

The PostgreSQL query planner will considerusing an R-tree index whenever an indexedcolumn is involved in a comparison using oneof these operators: ¿ À ¿| À| ...

1 120

1 120

All storage engines support at least 16 indexes pertable and a total index length of at least 256 bytes

PostgreSQL supports partial indexes with ar-bitrary predicates, so long as only columns ofthe table being indexed are involved

0.1 120

Table 2: Compare indexing support of MySQL and PostgreSQL by joining Storage engine and PostgreSQLconcepts

problem is solved by hiring specialists in the art or by sub-contracting to an external company to do the analysis man-ually. Our TDBMS can be used to alleviate this problemby partially automating this task. Table 2 shows top dis-tinct results of comparing MySQL and PostgreSQL databaseservers by supported index types (ranking is described inSection 4). This is done by joining the entity Storage enginein the text database generated from the MySQL manualwith the entity PostgreSQL from the database generatedfrom the PostgreSQL manual. Clearly, this is not an ex-haustive comparison of two database servers. However, itdoes demonstrate what can be done automatically, withouthuman intervention to help resolve this complex problem.

As a general algorithm operating on concepts extractedfrom text fuzzy join can be used to automatically compareany concepts from any text. For example, it also can beapplied to compare skills in resumes, product features inmanuals, or concepts in research papers.

Finally, we applied the concept matching algorithm tomatch similar concepts expressed by different words in twodatabase servers manuals. For instance, the following con-cepts were detected to be similar even though they do nothave any textual similarity: command and statement, youand user, section and chapter, storage engine and Post-greSQL.

Similarly to fuzzy join, concept matching is a general algo-rithm operating on extracted concepts and it can be appliedto any text. More generally, it can be used as a documentsimilarity semantic metric.

The rest of the paper is structured as follows. Section2 describes automatic schema extraction algorithm and ex-perimental results. Structural search and intuition behindthe algorithm is described in Section 3. Automatic com-parison, fuzzy join and ranking of join-results are in Section4. Our concept matching algorithm and experimental re-sults are described in Section 5 in more detail. Section 6describes query language for the new TDBMS. We reviewrelated work in Section 7 and conclude in Section 8.

2. AUTOMATIC SCHEMA EXTRACTIONWe use natural language processing to parse the MySQL

and PostgreSQL manuals and incrementally construct twoseparate schemas using a state-of-the art English sentencesparser.

The grammar in Figure 1 shows the parsing algorithm.Each sentence S is split into an arbitrarily long sequenceof N noun phrases and V verb phrases. For instance, the

S → NVS N εV → ε

Figure 1: Sentence parsing grammar

sentence “MySQL supports indexes widely used to improveperformance” will be split into the sequence“MySQL[N1],supports[V1], indexes[N2], used to improve[V2],performance[N3]”. After that, we load all the parsed sen-tences into the table T with columns (N1, V1, N2, V2, ...) ina relational databases. Each text file is processed separatelyand is loaded in a separate table.

To start schema extraction, we notice that the first col-umn in T usually contains the subject of a parsed sentence(MySQL in the example above). Clearly, it is not alwaystrue, e.g. for questions or complex sentences, but it is stillmore the rule than the exception. We leverage this to retrivethe main concepts (sentences’ subjects) of the documentsloaded into T by taking the most frequent values from T.N1.

A better metric, called concept weight counts only distinctsentences for a given subject. This is a stricter metric, be-cause it ignores all sentences that have the same predicateand object for a given subject and therefore promotes theconcept only for participating as a subject in substantiallydifferent sentences. Table 3 illustrates the main conceptsextracted from two database servers’ user manuals, sortedby concept weight. Notice, that all top concepts are ex-actly what one would expect for a database server manual.Moreover, the lists were generated independently from twodifferent text databases (each populated from a manual, ex-cluding noise words), but the top concepts are very similar(see Table 3).

We further construct the concept structure by extractingthe most frequent actions it can perform and defining themto be its attributes. They can be extracted from T by tak-ing the most frequent values of T.V1 for a given concept inT.N1. For instance, the action allows occurred 7 times asa predicate in the sentences where MySQL was the subject.Similarly to concept weight discussed above a better metricis to count the number of distinct objects that appear inthe sentences with a given subject and predicate. This isa better metric (define it to be attribute weight) than pred-icate frequency, because it ignores the sentences that havethe same predicate and object for a given subject. Table 4illustrates the extracted structure of two concepts MySQLand You sorted by attribute weight.

Concept Weightyou 350we 105function 50PostgreSQL 42option 34query 24table 22command 21file 19server 19user 19application 18value 17system 17database 16index 15frontend 14view 14column 14row 13... ...

Concept Weightyou 990we 142MySQL 110server 58table 57statement 50option 42value 40Innodb 38file 33function 31variable 25column 23section 22query 20client 19slave 18user 18mysqld 18privilege 18row 17... ...

Table 3: MySQL and PostgreSQL concepts

The set of structured concepts inferred from parsed textusing the algorithms above is the schema for our text data-base. Notice, that no data is stored in the inferred schema.Parsed sentences are in table T (N1, V1, N2, V2, ...) in the re-lational database.

Also, notice that this is a very general algorithm, since itrelies only on the natural language sentence structure anddoes not depend on any specific terms, words or patterns[6]. It is also not restricted to any specific area of interestand therefore works for text on any topic.

3. STRUCTURAL SEARCHBelow we describe how we can do structural search by

leveraging inferred schema.Consider the problem to find in MySQL manual (auto-

matically) what InnoDB (a MySQL storage engine) sup-ports. One of the approaches would be to do keyword searchthrough the manual on ’InnoDB’ + (provide or support oroffer). This returns near 50 sentences, ≈ 35% of which donot match the focus of structured query (e.g.“You can omitthese command lines if you to not require InnoDB or BDBsupport”).

On the other hand, we can use the schema extracted fromMySQL manual and issue a selection query on entity Inn-oDB and its attribute support. It will be automaticallymapped by our engine into a select statement on the under-lying relational table T (N1, V1, N2, V2, ...) and retrieve thesentences that contain InnoDB as a subject and provide orsupport or offer or their synonyms and derived forms as apredicate. This returns 26 sentences, three of which areshown in Table 1. We missed only 4 useful sentences (e.g.‘Support for XA transactions is available for the InnoDBstorage engine’), but filtered out ≈ 20 that do not match thefocus of structured query. Thus, structural search performsmore focused and therefore more precise at the expense ofcoverage.

You Weighthave 34can use 23must own 14need 14must have 10can create 8can do 8get 6want 6write 6will need 4... ...

MySQL Weightuses 40supports 21is 10has 7converts 7does not support 6can use 6allows 6creates 5provides 4handles 4... ...

Table 4: Concept structure

4. AUTOMATIC COMPARISONConsider another problem of comparing two concepts from

text. For example, let us compare Storage engine from theMySQL manual and PostgreSQL from the PostgreSQL man-ual. Similarly to the previous section, as a first approach,consider doing keyword search by Storage engine over theMySQL manual and by PostgreSQL over the PostgreSQLmanual and matching all the resulting sentences. Thereshould be good matches, but it is very hard to find themin a large result set even with high quality sentence match-ing and further ranking. This is because it is important formatching where in the sentence the matching terms occur.

The first important (and probably obvious) observationis that by the nature of language sentences are about sim-ilar concepts if their subjects are similar or mean similarconcepts (see the next section on how to detect similar con-cepts expressed using different words). The next importantobservation is that if two sentences, in addition to havingsimilar subject, have the same or similar predicate in com-mon, they should match even better, because they are abouta similar concept that does a similar action. So, how can weleverage this inferred structure to filter them out and matchbetter?

To do much more precisely than just using keywords, con-sider selecting the sentences only with the subject containingstorage engine for the MySQL manual and PostgreSQL forthe PostgreSQL manual and predicate containing verbs is,support, implement, has, can or their synonyms and derivedgrammatical forms (e.g. are, provide(s), allow(s), ...). Thissignificantly reduces the number of retrieved sentences andmakes both result sets strictly focused on these two conceptsperforming specified actions.

Next, we describe two algorithms for fuzzy join betweenthese two sentence sets and a ranking function to sort joinresults.

4.1 Fuzzy join and ranking functionTo match most similar sentences, consider tokenizing the

sentences’ tails2 for both sets into words and then extract-ing stems from these words. This results in a set of stemsfor each sentence in both sentence sets. After that, we jointhe sentences from both sets pairwise by matching the ex-tracted stems for each sentence and sorting the join resultby computed join rank. Only the best match for each sen-

2a sentence without subject, predicate and other verbs -T.N2 + T.N3 + T.N4 + ...

tence is included in join result. Table 2 illustrates joiningStorage engine from MySQL manual with PostgreSQL fromPostgreSQL manual on is, support, implement, has, can andtheir derived grammatical forms. To further narrow the fo-cus of comparison, we require the sentences to contain thekeyword index. This guarantees the concepts are comparedonly on indices.

To rank join results we use the following ranking function:

joinrank(s1, s2) =

nXi=1

kti ·1

weight(ti)

where s1, s2 are the joined sentences, n is the number oftokens in one of the sentences tails (assume w.l.o.g. s2), ti

is the token, kti is the number of matches that generatestthi stem, weight(ti) is tth

i concept weight computed duringautomatic schema extraction.

This formula conveys the idea that the matches of morespecific concepts are more valuable than those of generalones (therefore weight(ti) is in the denominator). In addi-tion, it encourages multiple matches of the same term(orstem) by multiplying the inverse weight by the number ofsuccessful matches. For example, the top row in Table 2 hasthe following rank: 3 · 1/3 + 3 · 1/1 + 1 · 1/5 = 4.2. SinceB-tree from s2 tail matched b-tree from s1 three times andthe concept b-tree has concept weight 3, therefore the firstsum item is 3 · 1/3. The stem of R-tree tree, matched threetimes and R-tree has weight 1. Hash matched once and hasweight 5, which sums up to 4.2.

The complexity of this join algorithm is O(n2) where n isthe number of words in text. We can significantly improveperformance by using full-text indexing on sentence tails byslightly sacrificing precision. Consider the same algorithm,which instead of counting the number of matches kti doesfull-text index search on each token stem. This will raise thecomplexity to O(log(m)) · smax · s ≤ O(log(n)) · smax · s ≤O(log(n) · n

smin) ∝ O(nlog(n)) where m is the number of

distinct indexed terms (whose upper bounded is the num-ber of words n), smax is the constant specifying the numberof words in the longest sentence, s is the number of sentencesto join (which has upper-bound n

smin), n is the number of

words in text, and smin is the number of words in the short-est sentence.

The second algorithm performs much faster at the expenseof slightly decreased accuracy, because full-text index lookupis unable to detect the number of matches. We computejoinrank for the second algorithm by using the same formulaand setting kti ≡ 1, ∀i (optrank column in Table 2).

5. CONCEPT MATCHINGTo detect similar concepts in text that are expressed us-

ing different words we present a concept matching algorithmthat works on the extracted schema. For instance, it can de-tect that the concept PostgreSQL in the PostgreSQL manualis similar to MySQL and InnoDB3 from the MySQL manualeven though there is no textual similarity between them.

The basic intuition behind the algorithm is that two con-cepts are similar if they do similar actions on similar objects.In terms of inferred schema (see Table 4) this implies that:

• the more attributes two concepts have in common (e.g.PostgreSQL and MySQL both have supports attribute)

3MyISAM, InnoDB, Memory are MySQL storage engines

Concept1 Concept2 Simsection chapter 4.62MySQL PostgreSQL 2.75you user 1.75server PostgreSQL 1.66InnoDB PostgreSQL 1.54statement command 0.29MaxDB PostgreSQL 0.21MySQL query 0.14... ...

Table 5: General concept matching

Concept1 Concept2 SimStorage engine PostgreSQL 0.12MyISAM PostgreSQL 0.12index PostgreSQL 0.036table PostgreSQL 0.036column PostgreSQL 0.036user PostgreSQL 0MySQL PostgreSQL 0... ...

Table 6: Focused concept matching

• the more similar objects in the original sentences cor-respond to the common attributes (e.g. PostgreSQLsupports indexes ...; MySQL supports indexes ...)

then the more similar the concepts are.In addition, there is a difference between detecting gener-

ally similar concepts and detecting concepts that are simi-lar in some specific way (focused concept matching). Con-sider comparing indexes in PostgreSQL and MySQL by us-ing fuzzy join between concepts PostgreSQL and MySQL.The result will be a table similar to Table 2, however itwill contain many fewer rows. This is because supportedindex types in MySQL server depend on the storage en-gine3, whereas PostgreSQL does not support multiple stor-age engines. Therefore the majority of sentences about in-dex types have a specific storage engine as a subject insteadof MySQL. Therefore, the concept PostgreSQL has more incommon with the concept Storage engine than with MySQLfor purposes of comparing indexes, whereas in general it ismore similar to MySQL.

Briefly, the algorithm works by iterating over the con-cept list (Table 3) and accumulating (for each concept) alldistinct predicates that appear in the sentences containinga concept in the subject. After that it computes pairwiseconcept similarity using the following similarity function:

sim(c1, c2) = m ·mX

i=1

1

w(ai)+ n2 ·

mXi=1

log wmaxw(ai)Pn(i)

j=1 w(oij)

where m is the number of distinct attributes ai two con-cepts c1, c2 have in common, n is the number of commondistinct objects, and n(i) is the number of distinct commonobjects for ith common attribute.

The first summand is a sum of inverse weights of con-cepts’ common attributes multiplied by the number of dis-tinct common attributes. Having a rare attribute in com-mon is more important than having a frequent one. This iswhy inverse weights are summed.

If in addition to the attribute (such as supports in theexample above), the concepts have a common object (e.g.indexes), it is taken into account by second summand, whichis a sum of inverse weights of distinct common objects. Ac-cording to the experiments, it is considered to be quadrat-ically more important than having common attributes andis reflected by multiplying the second sum by n2.

The absolute value how much a common object contributesto concept similarity also depends on the attribute to whichthe common object corresponds. This is the intuition behindwhy all sum members are weighted by log(wmax/w(ai)).How important for concept similarity an attribute is de-termined by the ratio of the absolute maximum attributeweight wmax and the attribute weight w(ai). log is usedto reduce potentially large absolute values for rare objects.Several similar concepts are shown in Table 5 as an exampleof applying this metric.

Focused concept matching works similarly, except it ac-cumulates for each concept all the distinct predicates thatappear in the sentences containing a concept in the subjectand a specific focus keyword in the object. The results ofconcept matching focused on indexes (focus keyword is in-dex) are in Table 6.

Our prototype caches concept similarity and suggests sim-ilar concepts to the user for each join query. For example,for a query PostgreSQL fuzzy join MySQL on index alongwith the join result it will output as a suggestion top simi-lar concepts for both PostgreSQL and MySQL resulted fromfocused matching (i.e. storage engine, MyISAM).

6. QUERY LANGUAGEIn this section we describe the query language of the new

text database management system (TDBMS) prototype. Itsschema consists of entities with attributes. Table 4 illus-trates two sample entities You and MySQL and their at-tributes.

Two types of queries are currently supported - selectionand fuzzy join queries. Both of them return a set of sentencesas a result set. The result set for fuzzy join queries is sortedby join rank (described in Section 4.1) in descending order.Everything that is in square brackets is optional.

• e1[, .., en]/[a1, .., am] [where K]

• e1[, .., en]/a1[, .., am]fuzzy join e1[, ..,ek]/a1[, .., al] [on K]

The first one is a selection query on entities e1, .., en andtheir attributes a1, .., am or its synonyms and grammaticalforms. As an example consider the query InnoDB/supportsand its result set in Table 1. A selection query on several en-tities and/or attributes returns the union of results of selec-tion queries from all the entities and attributes involved. Forinstance, the query MyISAM, InnoDB/requires, allows willreturn all the sentences containing MyISAM or InnoDB inthe subject and requires or allows (and its synonyms andgrammatical forms) in the predicate. If where K clause isspecified the result set is filtered by requiring all the sentencetails4 to contain the set of keywords K.

The second one is a fuzzy join query. First, it will selecttwo sentence sets according to the semantics of the selection

4a sentence without subject, predicate and other verbs -T.N2 + T.N3 + T.N4 + ...

queries. If K - a set of keywords is specified, all the sen-tences in both intermediate result sets will be restricted tocontain K in their tails (equivalent to specifying where Kclause for both participating selection queries). Finally, itwill execute fuzzy join algorithm between two intermediateresult sets and sort the output by join rank in descendingorder. As an example, consider the queryStorage engine/supports, is, implements fuzzy joinPostgreSQL/supports, is on index and its result set in Ta-ble 2.

7. RELATED WORKIn [6] Brin introduced an algorithm to extract tuples from

the Web that are similar to a small“training set” of pairs(e.g. ¿ author, title À). The basic idea was to match agiven set of tuples against web pages to generate a generalpattern that can be further bootstraped to retrieve moreresults. In [3] Agichtein extended this idea by using named-entity tagging (e.g. location, organization), weighting andconfidence metrics to compose better patterns. Downeyet al. in [11] suggested using learned patterns as extrac-tors and discriminators to improve both coverage and ac-curacy of information extraction. Banko et al. in [5] de-scribed a question-answering system reformulating the spe-cific questions into likely substrings of declarative answersto the question and submitting them to Google. E.g. for“Where is the Louvre Museum located” the reformulationswill be“+the Louvre Museum +is located”,“+the LouvreMuseum +is +in”, “+the Louvre Museum +is near”, etc.Ask.com is, probably, the most widely known commericialquestion-answering system that also works by reformulatingspecific questions and matching the resulting phrases againstthe Web.

Our approach is substantially different as it is neither re-stricted to a specific pattern format, nor aims to extract aspecific relation or answer a specific question.

Etzioni et al. in [13] uses a more general approach to ex-tract hyponym relation from Web pages by using domainindependent patterns (cf. Hearst [14]). Crescenzi in [9] triesto generalize wrapper generation by matching dynamicallygenerated HTML pages of data-intensive web sites (e.g. on-line book stores) and approximating the underlying data-base schema. Mindnet [17] is, probably, the most generalsystem to automatically construct an approximation of asemantic network [10] that is a graph representing seman-tic relationship between words. [12] is a question-answeringsystem capable to construct a meta-repository of semanticobjects either automatically by extracting triples of form¿ N, V, N À from text or manually through the user in-terface. It is able to match user requests to its semanticobjects and output relevant results. In [8] Cafarella et al.build a large-scale (from 90-million Web-page corpus) ex-traction graph from triples similar to [12] and show experi-mental results of keyword-based spreading-activation searchwith depth 1 over this graph.

Our approach is substantially different from [17], [12],and [8], because instead of constructing a semantic net-work, meta-repository or extraction-graph, we first extractthe main concepts from text and infer schema that is builtaround them reflecting their structure. Therefore, we areable to support structural search, join as well as conceptmatching that are not available in either [17], [12] or in [8].

[15] is an exhaustive survey of typical web data extraction

approaches and tools classified into six groups: Languagefor wrapper development, HTML-aware tools, NLP-basedtools, Wrapper induction tools, modelling-based tools, andontology-based tools. Finally, [2] is a search-engine thatleverages NLP to resolve part-of-speech-, phrasal-, and con-textual ambiguity and provide better search experience.

8. CONCLUSIONIn this paper we presented a new text database manage-

ment system (TDBMS) based on novel algorithms to au-tomatically extract schema from text, perform fuzzy joinbetween extracted entities, and detect similar semantic con-cepts expressed using different words (Table 5, Table 6). Weapplied it to perform powerful structural search (Table 1)and automatic software comparison (Table 2). Our experi-mental results justify that structural search is more focusedand therefore performs more precisely than keyword searchat the slight expense of coverage. Demonstrated results ofautomatic software comparison can be used by end-users,software consultants, and large enterprises that commonlyhave to choose between software products.

Finally, all the presented algorithms are very general be-cause they operate on concepts extracted from text. There-fore, fuzzy join can be used to automatically compare con-cepts from any text, e.g. to compare skills in resumes, prod-uct features in manuals, or concepts in research papers. Sim-ilarly, concept matching can be used as a document similaritysemantic metric.

9. REFERENCES[1] http://www.google.com/corporate/tech.html.

[2] The infocious web search engine: Improving websearching through linguistic analysis. Infocious Inc.,2005.

[3] E. Agichtein and L. Gravano. Snowball: Extractingrelations from large plain-text collections. In ACMDL, 2000.

[4] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: Asystem for keyword-based search over relationaldatabases. In ICDE, 2002.

[5] M. Banko, E. Brill, S. Dumais, and J. Lin. Askmsr:Question answering using the worldwide web. InEMNLP, 2002.

[6] S. Brin. Extracting patterns and relations from theworld wide web. In EDBT, 1998.

[7] S. Brin and L. Page. The anatomy of a large-scalehypertextual Web search engine. Computer Networksand ISDN Systems, 30(1–7):107–117, 1998.

[8] M. Cafarella, M. Banko, and O. Etzioni. Relationalweb search. Technical Report UW-CSE-06-04-02,2006.

[9] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner:Towards automatic data extraction from large websites. In VLDB, 2001.

[10] F. Crestani. Application of spreading activationtechniques in informationretrieval. Artif. Intell. Rev.,11(6):453–482, 1997.

[11] D. Downey, O. Etzioni, S. Soderland, and D. Weld.Learning text patterns for web information extractionand assessment. In AAAI, 2004.

[12] M. Elder. Preparing a data source for a natural

language query. United States Patent Application20050043940, 2004.

[13] O. Etzioni, M. Cafarella, D. Downey, S. Kok,A. Popescu, T. Shaked, S. Soderland, D. Weld, andA. Yates. Web-scale information extraction inknowitall. In WWW, 2004.

[14] M. A. Hearst. Automatic acquisition of hyponymsfrom large text corpora. Technical Report S2K-92-09,1992.

[15] A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira.A brief survey of web data extraction tools. InSIGMOD Record, 2002.

[16] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,R. A. Lorie, and T. G. Price. Access path selection ina relational database management system. InSIGMOD Record, 1979.

[17] L. Vanderwende, G. Kacmarcik, H. Suzuki, andA. Menezes. Mindnet: An automatically-createdlexical resource. In HLT/EMNLP, 2005.

Exploiting Community Behavior for Enhanced LinkAnalysis and Web Search

Julia LuxenburgerMax-Planck Institute of Informatics

Stuhlsatzenhausweg 85Saarbrücken, Germany

[email protected]

Gerhard WeikumMax-Planck Institute of Informatics

Stuhlsatzenhausweg 85Saarbrücken, Germany

[email protected]

ABSTRACTMethods for Web link analysis and authority ranking suchas PageRank are based on the assumption that a user en-dorses a Web page when creating a hyperlink to this page.There is a wealth of additional user-behavior informationthat could be considered for improving authority analysis,for example, the history of queries that a user communityposed to a search engine over an extended time period, orobservations about which query-result pages were clicked onand which ones were not clicked on after a user saw the sum-mary snippets of the top-10 results.This paper enhances link analysis methods by incorporatingadditional user assessments based on query logs and clickstreams, including negative feedback when a query-resultpage does not satisfy the user demand or is even perceivedas spam. Our methods use various novel forms of advancedMarkov models whose states correspond to users and queriesin addition to Web pages and whose links also reflect therelationships derived from query-result clicks, query refine-ments, and explicit ratings. Preliminary experiments arepresented as a proof of concept.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—relevance feedback, retrieval models

Keywordsnegative feedback, link analysis, web search, query logs

1. INTRODUCTIONImproving the ranking of web search results by means of

link analysis and derived authority scores has become a defacto standard, with the PageRank [8] algorithm being themost prominent approach. However, the increasing amountof web spam and the continuous growth of low-quality websites are major impediments to the viability of authorityranking in a world of exploding information and demand-ing users. On the other hand, the users’ assessments of webpages should not be limited to the implicit endorsements bylinks. Rather, users can contribute in the form of explicitfeedback, by marking search results as relevant, implicitly

Copyright is held by the author/owner. Ninth International Workshop onthe Web and Databases (WebDB 2006), June 30, 2006, Chicago, Illinois.

by clicking on search results, visiting certain pages (clickstreams), by blogs, wikis, and so forth. Moreover initiativesare arising towards a tagged web in which hyperlinks are nolonger purely based on navigational purposes but augmentedby semantic meaning, in its simplest form by ”like” and ”dis-like” statements [7]. This calls for novel forms of extendedauthority analysis to harness the newly arising ways of as-sessments, especially expressions of disliking a page, which,to our knowledge, have not been addressed in the context ofauthority analysis.

PageRank completely ignores the different intentions thatlead a web page author to create a hyperlink which maybe purely navigational, or of recommending or disapprovingflavor. The PageRank algorithm mimics a random surferwho starts on some page, then browses the web by followingoutgoing hyperlinks uniformly at random with probabilityǫ, or re-starts by a random jump with probability 1−ǫ (withuniformly selected jump target). This is formally modeledas a Markov chain, the unique equilibrium probability distri-bution of which yields stationary visiting probabilities, thatconsitute the vector of PageRank scores ~p. Mathematically,PageRank is cast into the equation

~p = ǫ · ~r + (1 − ǫ) · AT~p

where ~r denotes the random jump vector with∑

i ~ri = 1,and A is the row-normalized adjacency matrix defined bythe hyperlink structure of the web that already includes thetreatment of dangling nodes.

Various approaches exist for how to exploit implicit feed-back from query logs for web search. [12] employs queryclustering for the identification of frequently asked ques-tions. This method is, however, restricted to the very querycontext, and not able to take advantage of the gatheredknowledge for an improvement of search result quality ofpreviously unseen queries. [1] learns term correlations be-tween terms occuring in clicked documents and terms consti-tuting the corresponding queries for improved query expan-sion. [10] uses implicit feedback information of the currentsearch session for better estimating query language modelsinside the KL-divergence retrieval model. [5] exploits query-log data to learn retrieval functions using a Support VectorMachine (SVM) approach. [2, 3, 7] point out the semanticdifficulty of distrust propagation, but at the same time showthe potential of considering negative endorsements. In thecontext of recommender systems, [3] aims at the predictionof pairwise trust of one node into another, however, theydo not tackle the problem of absolute trust measures we ad-

-1

-1+1

+1

0

0

0

0

0

+1

Figure 1: Data model

dress. [2] proposes facilitating PageRank-style distrust prop-agation by first computing PageRank on the trust relationsand then subtracting the PageRank of sources of distruststatements from the ranks of their targets.

The approaches we propose build on our earlier work [6],based on a Markov-chain model with queries as additionalnodes, additional edges that capture query refinements andresult clicks, and corresponding transition probabilities. Thisprior work did, however, consider only positive feedback, in-ferred from a user clicking on a query result. The modelcould not express negative feedback from not clicking on aresult although a lower-ranked result was clicked on. Themethods of the current paper, on the other hand, supporta much richer model that can handle also the case of non-clicked result pages, and moreover, can capture and exploitmore general forms of negative assessment such as assigningtrust levels to Web pages (e.g., marking a Web page as spam,low-quality, out-of-date, or untrusted [3]). For example,within a PageRank-style link analysis, if many users expressdistrust in a particular page then the authority (PageRankscore mass) that this page receives from its in-link neighborsshould be reduced. A key difficulty in exploiting both pos-itive and negative assessment is that negative bias cannotbe easily expressed in terms of probabilities, as probabilitiesare always non-negative and L1-normalized. We pursue sev-eral approaches that extend standard Markov models, oneof which is based on a Markov reward model [11] where theassessment part is uncoupled from the random walk in theextended Web graph.

The rest of the paper is organized as follows. Section 2 in-troduces three different ways of integrating user assessmentsinto Markovian authority propagation models. Preliminaryexperiments on two datasets are presented in Section 3.

2. BEHAVIOR-SENSITIVE AUTHORITY

2.1 Data modelAs depicted in Figure 1, the graph model we consider is

general enough to allow for typed nodes representing differ-ent entities, as well as tagged links carrying rating informa-tion. The displayed example graph indicates web pages bysquared nodes, and queries by round ones. Directed linksconnecting them express categorical judgements , i.e., wedistinguish the three ratings positive (+1), neutral (0), andnegative (-1) - our models, however, can be easily extendedto allow for more fine-grained quantifications. Thus we con-sider three link types depending on the rating associatedwith them. Let E denote the set of all links, E+ the set oflinks carrying an positive assessment, E0 the set of neutral,and E− the set of negative links. Furthermore S is the set ofall nodes with the subsets S+ and S− denoting the sourcesof positive and negative links respectively.

2.2 QRankQRank has been introduced in [6] to exploit implicit pos-

itive feedback obtained from query logs. QRank distin-guishes between two node types, queries and web pages,and represents query-result clicks as well as query refine-ments by directed links from queries to pages and betweenqueries respectively. To cast QRank into our general datamodel, we consider a variant of QRank that ignores virtuallinks based on textual similarity between documents andqueries, repectively, and performs transitions uniformly atrandom. Random jumps are biased towards the sources ofpositive feedback whereby the bias strength is regulated bythe parameter β.

The QRank model however faces some limitations in thatit cannot model negative feedback. Assume a user marks asearch result as irrelevant for a certain query. Given thatsome other user gave positive feedback on the very same re-lation, i.e., the QRank graph already contains a link fromquery A to document B, we can model the presence of neg-ative feedback by reducing the transition probability fromA to B with respect to all other links leaving A. In the casethat there is no such link yet, we lack means to model neg-ative feedback inside QRank. In the following we present anumber of approaches that integrate negative endorsements.

2.3 QLoop∗

The first algorithm, we propose, is not based on PageRankitself, but on a slight variant, the self-loop algorithm, whichdiffers from PageRank only by the introduction of self-loopseach node performs with probability δ, i.e,

~s = ǫ · ~r + δ · ~s + (1 − ǫ − δ) · AT~s

For the difference between the induced stationary visitingprobabilities, ~πp and ~πs, we can derive the following upper-bound in terms of the L1-norm

|| ~πs − ~πp||1 ≤2 · δ

ǫ

The change in ranking order is however more important forweb search than changes in terms of absolute ranking scores.Just by reasoning on the defining equation of the self-loopalgorithm which turns into PageRank as δ → 0, we findthat authority scores under both algorithms share some basecontribution which stems from random jumps, and differ inhow much authority is propagated via incoming links. Thusself-loops reduce the influence of predecessors in favor ofsome selfishness of always keeping a fraction of own author-ity. As a consequence low-indegree nodes experience a slightboost in score with self-loops being added, while auhorita-tive pages under PageRank undergo small perturbation dueto the reduced authority that is propagated to them andrecursive changes in authority propagation. This intuitionis experimentally underpinned by comparing scores of cor-responding nodes under the two algorithms (Figure 2 plotsnodes and their logarithmic-scaled scores in descending or-der of PageRank scores).

Analogously, we may consider a self-loop augmented vari-ant of QRank, coined QLoop, forming the basis of a holisticapproach to integrate both positive and negative endorse-ments into link analysis. To infer a notion of community-level authority, we consider a hybrid method, QLoop∗, thatmodels positive ratings the way QLoop does, and translatesnegative ratings into node-specific loop and jump probabili-ties. To make successors of a punished node not benefit from

-9

-8

-7

-6

-5

-4

0 100 200 300 400 500 600 700 800

PageRankSelf-loop δ=0.15

Self-loop δ=0.3

Figure 2: PageRank vs Self-loop algorithm

changes in the self-loop probability δ, we re-distribute theremaining probability mass by increasing the respective ran-dom jump probability. The amount by which we decreasethe self-loop probability of a negatively judged node dependson the authority scores of its predecessors which are esti-mated by computing QRank in a pre-processing step. Thatway we facilitate an intertwining of assessment and author-ity propagation. To back our intuition that a decrease of theself-loop probability δ∗ of a selected node i∗ indeed results ina decreased score, we reason on the defining equation of theself-loop algorithm. Making the contributions of incominglinks and dangling pages encoded in the A matrix explicitand assuming i∗ is non-dangling, we have

~si∗ = ǫ · ~ri∗ + δ∗ · ~si∗ + (1− ǫ− δ) ·

∑

(j,i∗)∈E

~sj

oj

+∑

j∈DP

~sj

|S|

where oj denotes the outdegree of j and DP is the set ofdangling pages. Thus under the assumption that no othernode undergoes changes in ǫ and δ, δ∗ < δ implies a reducedscore for i∗.

Definition 1 (QLoop∗). Let ~πq denote the stationary

visiting probabilities under QRank, and A the adjacency ma-

trix over E+ ∪ E0 including the handling of dangling nodes.

Then QLoop∗ scores, denoted by ~qloop∗, are defined as fol-

lows

~qloop∗ = ǫ · ~r + δ · W T~qloop∗ + (1 − ǫ − δ) · AT

~qloop∗

with random jumps being biased according to

~ri =

β

|S+∪S−|, if i ∈ S+ ∪ S−

1−β

|S−(S+∪S−)|, otherwise

and self-loops adjusted according to

Case 1: ∃k ∈ S− : (k, i) ∈ E−

1. with normalization

wij =

1 −∑

k|(k,i)∈E−

~πq(k)

|l∈S|(k,l)∈E−|, if i = j

1|S|−1

·∑

k|(k,i)∈E−

~πq(k)

|l∈S|(k,l)∈E−|, if i 6= j

2. without normalization

wij =

1 −∑

k|(k,i)∈E−

~πq(k), if i = j

1|S|−1

·∑

k|(k,i)∈E−

~πq(k), if i 6= j

Case 2: ∄k ∈ S− : (k, i) ∈ E−

wij =

1 , if i = j

0 , if i 6= j

Theorem 1. QLoop∗ defines an ergodic Markov chain.

Proof. Due to space limitations we just outline the nec-essary steps. We can show that W is a stochastic matrixwhich implies that QLoop∗ defines a Markov chain. Irre-ducibility is ensured by random jumps, and aperiodicity is aconsequence of the self-loops. From Markov chain theory, weknow that a finite, irreducible and aperiodic Markov chain,is also ergodic. Thus QLoop∗ converges.

2.4 Behavior-sensitive JumpsIn resemblance to personalized PageRank [4], we propose

to integrate additional assessments into the process of au-thority propagation by the aggregation of endorsements anddisapprovals into a biased random jump vector. Thus nodesreceiving positive ratings are more often starting states of anew path the random surfer pursues than nodes judged tobe of poor quality. Let R be a matrix of rewards such that

rij =

−1, if (i, j) ∈ E−

0, if (i, j) ∈ E0

1, if (i, j) ∈ E+

Depending on how we choose to aggregate recommendationsand disfavors, we distinguish between three incarnations ofbehavior-sensitive random jump vectors, denoted by ~rBS inthe following.

Uniform: ~rBS(j) =∑

i rij

Normalized: ~rBS(j) =∑

i

rij∑

j |rij |

Weighted: ~rBS(j) =∑

i rij · ~π(i)

This first aggregation of ratings is followed by a normal-ization step, the addition of the one vector, and a final re-normalization step yielding the final jump vectors. In theweighted feedback aggregation scenario, stationary visitingprobabilities under PageRank (E0 defines the link structure)with β-biased random jumps to nodes in S− ∪ S+ serveas authority scores ~π. The following theorem gives an up-per bound in terms of L1-norm on the difference betweenthe steady-state probability distributions of PageRank andbehavior-sensitive random jumps.

Theorem 2. Let ~πBS denote the unique equilibrium prob-

ability distribution under behavior-sensitive random jumps.

Then ||~πBS − ~πp||1 ≤ ||~rBS − ~r||1.

Proof.

~πBS − ~πp = ǫ · (~rBS − ~r) + (1 − ǫ) · AT (~πBS − ~πp)

⇔ ~πBS − ~πp = (I − (1 − ǫ) · AT )−1 · ǫ · (~rBS − ~r)

Thus

||~πBS − ~πp||1 ≤ ǫ · ||(I − (1 − ǫ) · AT )−1||1 · ||~rBS − ~r||1

≤ ǫ ·1

ǫ· ||~rBS − ~r||1

≤ ||~rBS − ~r||1

2.5 Markov Reward ModelInspired by the use of Markov reward models in the field of

performance and dependability analysis, we propose to aug-ment the graph model representing the hyperlink structureof the web with an additional reward structure. Therebyeach page is associated with a reward accumulator vari-able - collectively denoted by the vector ~g, which is up-dated each time the page is visited depending on the tran-sition’s reward. This reward depends on the transition’ssource and target and is derived from the query-log andclick-stream information as well as explicit page assessments.With R = (rij) denoting a reward matrix as defined in Sec-tion 2.4, each transition along (i, j) ∈ E results in an updateof the vector ~g according to ~gn(j) = ~gn−1(j)+ rij . Then thelong-run average reward each node accumulates,

~g∞(i) = limn→∞

1

n· ~gn(i)

gives an assessment-based measure of its quality. Therebythe contribution of each rating is implicitly weighted by theauthority of its source given by how often it is visited duringa random walk. In the following we present a theorem thatallows us to compute the long-run average reward of eachnode efficiently.

Theorem 3. Let A = (aij) denote a transition probabil-

ity matrix defining a Markov chain, and ~π be the correspond-

ing stationary visiting probability distribution. Then

limn→∞

1

n· ~gn(i) = lim

n→∞

1

n·

n∑

k=1

rvkvk+1=

∑

(j,i)∈E

rji · aji · ~πj

Due to space limitations we omit the proof.

QRank computed on the total set of edges E serves as ourbaseline for the derivation of transition probabilities (aij)and stationary visiting probabilities ~π. We compute thefinal ranking scores ~πg, coined QReward, as a linear com-bination of the re-normalized long-run average reward withthe underlying authority scores as follows.

~πg = α · ~g∞ + (1 − α) · ~π

QDiscounterIn addition, we consider a slight variant of QReward, coinedQDiscounter, that is lazy in computing the long-run aver-age rewards and simply omits the multiplication with thetransition probabilities, i.e.,

limn→∞

1

n· ~gn(i) ≈

∑

(j,i)∈E

rji · ~πj

The underlying visiting probabilities ~π are computed usingQRank on E0 and β-biasing the nodes in S− ∪ S+.

Table 1 summarizes the various ranking methods we con-sider, and indicates for each method the link structure andrandom jumps it builds on, as well as the parameter it re-quires.

Algorithm Links Parameter Random Jump

PageRank E0 ǫ Uniform

QRank E0∪ E

+ ǫ, β Biased (S+)

QLoop∗

E0∪ E

+ ǫ, β, δ Biased (S+∪ S

−)

BS Jumps E0 ǫ, β Biased

QReward E ǫ, β, δ, α Uniform

QDiscounter E0 ǫ, β, δ, α Uniform

Table 1: Overview of Authority Ranking Methods

3. PRELIMINARY EXPERIMENTS

3.1 Data collectionAs datasets with positive and negative endorsements are

difficult to obtain outside the commercial search engine com-panies, we created our own data collections based on twodatasets with very different properties: an excerpt of theweb pages of the Wikipedia Encyclopedia, and a linkagegraph constituted by product data of Amazon.com.

WikipediaStarting from overview pages about geography, history, film,and music we crawled 72482 documents on a downloadeddump of Wikipedia to build a thematically concentrateddataset and indexed it by our own prototype search engine.For query session generation, we asked 18 volunteers, stu-dents with diverse backgrounds (law, psychology, intercul-tural communication, etc) to search our data collection. Weprovided some creativity help in the form of Trivial Pursuitquestions and asked the volunteers to concentrate on thecategories geography, history, and entertainment. But theywere still allowed to freely choose their queries or follow somepersonal interests to simulate real web search. Parsing thegenerated browser history files, we obtained 542 queries, 760query result clicks (implicit positive feedback), 290 queryrefinements and 1987 implicit negative feedback links. Weinterpreted each non-clicked document appearing above aclicked one as negative feedback, driven by the justificationthat the user saw the summary snippets of these pages andintentionally skipped them. For evaluation, we chose 14queries (see Table 2) at random from a set of queries thathad been posed by users during query session generationand for which textual-similarity based retrieval yielded re-sult sets of size at least 50. 10 out of the 14 queries werealso associated with negative assessments.

AmazonWith the help of the Amazon E-Commerce web service weconstructed a graph similar in structure to the enhanced webgraph we obtained from the Wikipedia data. We distinguishtwo node types, items and customers, with the latter corre-sponding conceptually to the previous notion of queries. Weestablish a link from item A to item B whenever B is said tobe similar to A, i.e., customers who bought A also bought B.Furthermore a customer reviewing a particular item is rep-resented by a link which is associated with a positive rewardfor a rating greater than three stars, and a negative rewardfor ratings of less than three stars. Ratings of exactly three

Query PageR

ank

QR

ank

β=

1.0

QLoop∗

β=

1.0

δ=

0.3

QR

ew

ard

β=

1.0

α=

0.2

QD

iscoun.

α=

0.8

Uni.

Jum

pǫ

=0.5

birthplace mozart 0.51 0.65 0.68 0.64 0.64 0.53brazil cities 0.37 0.43 0.4 0.48 0.39 0.39political system ofchina

0.35 0.57 0.65 0.53 0.42 0.34

free elections ger-man democratic re-public

0.13 0.18 0.2 0.16 0.33 0.14

Egypt pyramids 0.2 0.55 0.56 0.55 0.65 0.25Napoleon exile 0.82 0.7 0.72 0.79 0.49 0.85Harrison Ford movie 0.16 0.41 0.41 0.32 0.35 0.19French wine 0.83 0.39 0.37 0.39 0.33 1John Paul II 0.25 0.5 0.5 0.5 0.5 0.25official language Sin-gapore

0.27 0.29 0.29 0.33 0.5 0.33

last play by Shake-speare

0.8 0.81 0.8 0.8 0.79 0.81

Nelson Mandelaprison

0.67 0.71 0.78 0.71 0.69 0.66

firefighter New York 0.25 0.23 0.25 0.18 0.45 0.26constitutionalsupreme court

0.73 0.73 0.73 0.8 0.81 0.63

Table 2: MAPs of evaluation queries on Wikipedia

stars result in neutral links as well. In that manner we con-structed a graph of 247688 items, 607663 customers, 1258487neutral, 912775 positive and 138813 negative reward links.

3.2 MethodoloyQuery result rankings are derived as follows. For each

query, we construct a seed set consisting of the top-50 queryresults solely based on textual similarity. For the Wikipediadataset these are retrieved according to Okapi BM25 [9],whereas Amazon builds on textual similarity scores on theeditorial reviews of products, with scores computed by theOracle Text product (which we used as a backend in ourimplementation). The query results are then re-ordered ac-cording to our pre-computed ranking schemes, and in caseof ties we fall back to the text-based scoring. As qualityassessments are usually sparse, we vary the graphs on whichrankings are to be computed to strengthen the influence ofquality judgements by means of back-links. These are neu-tral reversed links of rating-carrying links in E− ∪E+. Thatway we improve the reachability of nodes in S+ ∪S− whichare with our current datasets often solely reachable via ran-dom jumps.

3.3 ResultsFor evaluation of search result quality, we computed the

top-15 result rankings on Wikipedia, presented 8 volunteersan unordered list of URLs occuring in at least one resultranking for the given query and asked them to mark the rel-evant ones (possibly after consulting the linked result page).We had each query evaluated by 3 different users and tooktheir majority vote as the final relevance assessment. Thatway the obtained relevance assessments are consistent overall evaluated rankings. To account for the ranks at whichrelevant documents occur in a ranking, we chose to com-pute the mean average precision (MAP) of each query thatis sensitive to re-orderings in the result set, and defined as

15∑

r=1

precision@r ∗ rel(r)

#relevant docs

PageRank QRank β = 1.0 QRank β = 0.5

MAP 0.4511 0.5099 0.5136Deviation 0.09 0.08 0.08

QLoop∗

β 1.0 0.5δ 0.15 0.3 0.4 0.15 0.3 0.4

MAP 0.52 0.53 0.53 0.52 0.52 0.52Deviation 0.01 0.08 0.06 0.03 0.03 0.01

Normalized QLoop∗

β 1.0 0.5δ 0.15 0.3 0.4 0.15 0.3 0.4

MAP 0.52 0.53 0.53 0.52 0.52 0.52Deviation 0.08 0.03 0.04 0.08 0.08 0.04

BS-Jump

weighted uniform normalized

β 1.0 1.0 0.5 0.5 -ǫ 0.25 0.5 0.25 0.5 0.25 0.5 0.25 0.5

MAP 0.46 0.47 0.46 0.47 0.46 0.47 0.46 0.47Deviation 0.06 0.04 0.06 0.06 0.03 0.09 0.03 0.06

Markov reward model

α 0.2 0.4 0.5 0.6 0.8

QReward β = 0.5

MAP 0.49 0.48 0.48 0.47 0.51Deviation 0.002 0.02 0.09 0.09 0.002

QReward β = 1.0

MAP 0.51 0.49 0.49 0.48 0.52Deviation 0.1 0.0001 0.05 0.03 0.04

QDiscounter β = 0.5

MAP 0.49 0.49 0.52 0.52 0.52Deviation 0.05 0.002 0.005 0.07 0.007

QDiscounter without back-links β = 0.5

MAP 0.53 0.53 0.54 0.55 0.56Deviation 0.0004 0.08 0.005 0.03 0.04

Table 3: Average MAP/std. deviation on Wikipedia

where r denotes the rank, and rel(r) indicates whether thedocument at rank r is a relevant one.

Table 2 depicts the resulting MAP values for each queryevaluated on the Wikipedia dataset and some representa-tive ranking schemes. The averaged MAP values as wellas the standard deviation across queries for each consideredmethod are depicted in Table 3. QLoop∗ achieves improve-ments over both PageRank and QRank regardless of the val-ues we chose for β and δ. The MAP values of the normalizedvariant of QLoop∗ coincide with those of the non-normalizedversion, indicating that normalization plays a minor role forranking. The family of behavior-sensitive random jumpsoutperforms PageRank, but does not reach the performanceof QRank. Coding ratings inside the random jump vectorseems to have little effect, even under extreme parametersettings. The Markov reward model and its approximationare the most promising approaches with QDiscounter yield-ing significant gains in MAP compared to all other methods.QDiscounter achieved MAP values around 55 percent acrossthe spectrum of choices for α, compared to 51 percent forQRank and merely 45 percent for standard PageRank. In-terestingly, QDiscounter did not benefit from the introduc-tion of back-links, but showed better results with the normalgraph structure. This remains to be further investigated ondifferent datasets.

Table 4 shows top-5 result rankings (titles of Wikipediapages) for the query ”Political system of China”. To bet-ter understand the effects observed, Table 3.3 lists the doc-uments in the top-50 result set based on textual similar-

PageRank QRank β = 1

China ChinaPeople’s Republic of China One country - two systems

List of countries People’s Republic of ChinaCounty Communist state

Chinese language List of countries

QLoop∗β = 1, δ = 0.3 QReward β = 1, α = 0.2

One country - two systems One country - two systemsChina China

People’s Republic of China PrisonCommunist state List of countriesList of countries Communist state

QDiscounter

α = 0.4 α = 0.8

China PrisonPrison Communist state

Communist state Party disciplineParty discipline One country - two systems

One country, two systems China

BS-Jump uniform

ǫ = 0.25 ǫ = 0.5

China ChinaPeople’s Republic of China County

County People’s Republic of ChinaList of countries Hong KongChinese language Chinese language

Table 4: Top-5 for ”Political System of China”

PositiveOne country - two systems,

Prison, Communist state, Party discipline,

Negative

People’s Republic of China, China, Vice President,Chinese language, Mandarin linguistics,

Clash of Civilizations, Galileo positioning system

Table 5: Pos./neg. rewarded docs of ”Pol. Sys. of China”

ity that received positive or negative long-run average re-wards due to the implicit feedback obtained from query-logs.When comparing these two tables we see the different ex-tents to which the proposed methods combine endorsementswith standard link analysis.

Table 6 shows top-5 rankings of books computed on theAmazon dataset for the Google Zeitgeist query ”mountainbike”. In contrast to the Wikipedia dataset where query-logdata was sparse, Amazon offers a larger amount of ratingdata, and thus a more balanced ratio of customer to itemnodes. Again we observe a varying strength with whichratings are incorporated ranging from the behavior-sensitivejumps which are closest to PageRank, over QRank andQLoop∗ to the Markov reward model approaches which showthe most significant changes. The way our behavior-sensitiveapproaches favor specialized books on mountain bikes over”traditional” authorities on the subject of traveling, like the”Fodor’s” series, shows their effectiveness.

4. CONCLUSIONWe presented three novel algorithmic frameworks to in-

corporate additional user assessments into web link analysis,and underpinned their potential by preliminary experimentson two datasets. Currently, we are acquiring larger datasetswith a broad spectrum of user assessments to further inves-tigate the proposed algorithms.

5. REFERENCES[1] H. Cui, J.-R. Wen, J.-Y. Nie, W.-Y. Ma. Query

expansion by mining user logs. TKDE, 2003.

[2] R. Guha. Open rating systems. TR, Stanford, 2003.

PageRank

Fodor’s Prague and BudapestBobke II

Mountain Biking ColoradoZinn and the Art of Mountain Bike MaintenanceBicycling Magazine’s Complete Guide to Bicycle

Maintenance and Repair for Road and Mountain Bikes

QRank β = 1

Zinn and the Art of Mountain Bike MaintenanceBobke II

Bicycling Magazine’s Complete Book of Road Cycling SkillsBicycling Magazine’s Complete Guide to Bicycle

Maintenance and Repair for Road and Mountain BikesFodor’s Prague and Budapest

QLoop∗β = 1, δ = 0.3



Maintenance and Repair for Road and Mountain BikesFodor’s Japan

QReward β = 1, α = 0.8



Maintenance and Repair for Road and Mountain BikesExploring the Black Hills and Badlands:

A Guide for Hikers, Cross-Country Skiers, & Mountain Bikers

QDiscounter α = 0.4

Mountain Bike! Southern Utah: A Guide to the Classic TrailsFodor’s Prague and Budapest

Bobke IIZinn and the Art of Mountain Bike Maintenance

Mountain Biking Colorado

BS-Jump uniform ǫ = 0.25

Fodor’s Prague and BudapestBobke II

Zinn and the Art of Mountain Bike MaintenanceBicycling Magazine’s Complete Guide to Bicycle

Maintenance and Repair for Road and Mountain BikesMountain Biking Colorado

Table 6: Top-5 rankings for ”mountain bike” on Amazon

[3] R. Guha, R. Kumar, P. Raghavan, A. Tomkins.Propagation of trust and distrust. WWW, 2004.

[4] T. H.Haveliwala. Topic-sensitive pagerank: Acontext-sensitive ranking algorithm for web search.TKDE, 2003.

[5] T. Joachims. Optimizing search engines usingclickthrough data. KDD, 2002.

[6] J. Luxenburger, G. Weikum. Query-log based authorityanalysis for web information search. WISE, 2004.

[7] P. Massa, C. Hayes. Page-rerank: using trusted links tore-rank authority. TR, ITC - IRST, 2005.

[8] L. Page, S. Brin, R. Motwani, T. Winograd. Thepagerank citation ranking: Bringing order to the web.TR, Stanford, 1998.

[9] S. E. Robertson, S. Walker. Some simple effectiveapproximations to the 2-poisson model for probabilisticweighted retrieval. SIGIR, 1994.

[10] X. Shen, B. Tan, C. Zhai. Context-sensitiveinformation retrieval using implicit feedback. SIGIR,2005.

[11] H. C. Tijms. A First Course in Stochastic Models.John Wiley & Sons, 2003.

[12] J.-R. Wen, J.-Y. Nie, H.-J. Zhang. Query clusteringusing user logs. TIS, 2002.

Vision-based Web Data Records Extraction

Wei Liu, Xiaofeng MengSchool of Information

Renmin University of ChinaBeijing, 100872, China

gue2, [email protected]

Weiyi MengDept. of Computer Science

SUNY at BinghamtonBinghamton, NY 13902

[email protected]

ABSTRACTThis paper studies the problem of extracting data records onthe response pages returned from web databases or searchengines. Existing solutions to this problem are based pri-marily on analyzing the HTML DOM trees and tags of theresponse pages. While these solutions can achieve good re-sults, they are too heavily dependent on the specifics ofHTML and they may have to be changed should the re-sponse pages are written in a totally different markup lan-guage. In this paper, we propose a novel and language in-dependent technique to solve the data extraction problem.Our proposed solution performs the extraction using onlythe visual information of the response pages when they arerendered on web browsers. We analyze several types of vi-sual features in this paper. We also propose a new mea-sure revision to evaluate the extraction performance. Thismeasure reflects perfect extraction ratio among all responsepages. Our experimental results indicate that this vision-based approach can achieve very high extraction accuracy.

KeywordsWeb DB, response page, data record

1. INTRODUCTIONThe World Wide Web has close to one million searchable

information sources according to a recent survey[1]. Thesesearchable information sources include both search enginesand Web databases. By posting queries to the search inter-faces of these information sources, useful information fromthem can be retrieved. Often the retrieved information(query results) is wrapped on response pages returned bythese systems in the form of data records, each of whichcorresponds to an entity such as a document or a book.Data records are usually displayed visually neatly on Webbrowsers to ease the consumption of human users. In Figure1, a number of book records are listed on a response pagefrom Amazon.com.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ... $5.00.

Figure 1: A response page from Amazon

However, to make the retrieved data records machine pro-cessable, which is needed in many applications such as deepweb crawling and metasearching, they need to be extractedfrom the response pages. In this paper, we study the prob-lem of automatically extracting the data records from theresponse pages of web-based search systems.

The problem of web data extraction has received a lotof attention in recent years[2][5][6][7][8]. The existing solu-tions are mainly based on analyzing the HTML source filesof the response pages. Although they can achieve reason-ably high accuracies in the reported experimental results,the current studies of this problem have several limitations.First, HTML-based approaches suffer from the followingproblems: (1) HTML itself is still evolving and when newversions or new tags appear, the previous solutions will haveto be amended repeatedly to adapt to new specifications andnew tags. (2) Most previous solutions only considered theHTML files that do not include scripts such as JavaScriptand CSS. As more and more web pages use more complexJavaScript and CSS to influence the structure of web pages,the applicability of the existing solutions will become lower.(3) If HTML is replaced by a new language in the future,then previous solutions will have to be revised greatly oreven abandoned, and other approaches must be proposed toaccommodate the new language. Second, traditional perfor-mance measures, precision and recall, do not fully reflect thequality of the extraction. Third, most performance studiesused small data sets, which is inadequate in assuring theimpartiality of the experimental results.

There are already some works [9][12] that analyze the lay-out structure of web pages. They try to effectively repre-sent and understand the presentation structure of web pages,which are physical structure independent. But the researchon vision-based web data extraction is still at its infancy. Itis well known that web pages are used to publish informa-tion for humans to browse, and not designed for computersto extract information automatically. Based on such con-sideration, in this paper we propose a novel approach to

extract data records automatically based on the visual rep-resentation of web pages. Like [8][7], our approach also aimsat the response pages that have multiple data records. Ourapproach employs a three-step strategy to achieve this ob-jective. First, given a response page, transform it into aVisual Block tree based on its visual representation; second,discover the region (data region) which contains all the datarecords in the Visual Block tree; third, extract data recordsfrom the data region.

This paper has the following contributions:1. We believe this is the first work that utilizes only the

visual content features on the response page as displayed ona browser to extract data records automatically.

2. A new performance measure, revision, is proposedto evaluate the approaches for web data extraction. Themeasure revision is the percentage of the web sites whoserecords cannot be perfectly extracted (i.e., at least one of theprecision and recall is not 100%). For these sites, manualrevision of the extraction rules is needed.

3. A data set of 1,000 web databases and search engines isused in our experiment study. This is by far the largest dataset used in similar studies (previous works seldom used 200sites). Our experimental results indicate that our approachis very effective.

2. RELATED WORKSUntil now, many approaches have been reported in the

literature for extracting information from Web pages. Re-cently, many automatic approaches [5][6][7][8] have been pro-posed instead of manual approaches [2] and semi-automaticapproaches [3] [4]. For example, [6] find patterns or gram-mars from multiple pages in HTML DOM trees containingsimilar data records, and they require an initial set of pagescontaining similar data records. In [5], a string matchingmethod is proposed, which is based on the observation thatall the data records are placed in a specific region and thisis reflected in the tag tree by the fact that they share thesame path in DOM tree. The method DEPTA[7] used treealignment instead of tag strings, which exploits nested treestructures to perform more accurate data extraction, so itcan be considered as an improvement of MDR[8]. The onlyworks that we are aware of that utilize some visual informa-tion to extract data records are [13][14]. However, in theseapproaches, tag structures are still the primary informationutilized while visual information plays a small role. For ex-ample, in [13], when the visual information is not used, therecall and precision decrease by only 5%. In contrast, inthis paper, our approach performs data record extractioncompletely based on visual information.

Although the works discussed above applied different tech-niques and theories, they have a common characteristic:they are all implemented based on HTML DOM trees andtags by parsing the HTML documents. In Section 1, wediscussed the latent and inevitable limitations of them.

Since web pages are used to publish information for hu-mans to browse and read, the desired information we wantextracted must be visible, so the visual features of web pagescan be very helpful for web information extraction. Cur-rently, some works are proposed to process web pages basedon their visual representation. For example, a web pagesegmentation algorithm VIPs is proposed in [9] which sim-ulates how a user understands web layout structure basedon his/her visual perception. Our approach is implemented

based on VIPs. [10] is proposed to implement link analysisbased on the layout and visual information of web pages.Until now, the layout and visual information is not effec-tively utilized to extract structural web information, and itis only considered as a heuristic accessorial means.

3. INTERESTING VISUAL OBSERVATIONSFOR RESPONSE PAGES

Web pages are used to publish information on the Web.To make the information on web pages easier to understand,web page designers often associate different types of informa-tion with distinct visual characteristics (such as font, color,layout, etc.). As a result, visual features are important foridentifying special information on Web pages.

Response pages are special web pages that contain datarecords retrieved from Web information sources, and thedata records contained in them also have some interestingdistinct visual features according to our observation. Belowwe describe the main visual features our approach uses.

Position Features (PF): These features indicate the lo-cation of the data region on a response page.

• PF1: Data regions are always centered horizontally.

• PF2: The size of the data region is usually large rela-tive to the area size of the whole page.

Though web pages are designed by different people, thesedesigners all have the common consideration in placing thedata region: the data records are the contents in focus on re-sponse pages, and they are always centered and conspicuouson web pages to catch the user’s attention. By investigatinga large number of response pages, we found two interestingfacts. First, data regions are always located in the middlesection horizontally on response pages. Second, the size ofa data region is usually large when there are enough datarecords in the data region. The actual size of a data regionmay change greatly for different systems because it is notonly influenced by the number of data records retrieved butalso by what information is included in each data record,which is application dependent. Therefore, our approachdoes not use the actual size, instead it uses the ratio of thesize of the data region to the size of whole response page.

Layout Features (LF): These features indicate how thedata records in the data region are typically arranged.

• LF1: The data records are usually aligned flush left inthe data region.

• LF2: All data records are adjoining.

• LF3: Adjoining data records do not overlap, and thespace between any two adjoining records is the same.

The designers of web pages always arrange the data recordsin some format in order to make them visually regular. Theregularity can be presented by one of the two layout models.

In Model 1, The data records are arrayed in a single col-umn evenly, though they may be different in width andheight. LF1 implies that the data records have the samedistance to the left boundary of the data region. In Model2, data records are arranged in multiple columns, and thedata records in the same column have the same distanceto the left boundary of the data region. In addition, datarecords do not overlap, which means that the regions of dif-ferent data records can be separated. Based on our observa-tion, the response pages of all search engines follow Model 1while the response pages of web databases may follow either

of the two models. Model 2 is a little bit more complicatedthan Model 1 in layout, and it can be processed with someextension to the techniques used to process Model 1. Inthis paper, we focus on dealing with Model 1 due to thelimitation of paper length.

We should note that feature LF1 is not always true assome data records on certain response pages of some sites(noticeably Google) may be indented. But the indented datarecords and the un-indented ones have very similar visualfeatures. In this case, all data records that satisfy Model 1are identified first, and then the indented data records areextracted utilizing the knowledge obtained from un-indenteddata records that have already been identified.

Appearance Features (AF): These features capture thevisual features within data records.

• AF1: Data records are very similar in their appear-ances, and the similarity includes the sizes of the im-ages they contain and the fonts they use.

• AF2: Data contents of the same type in different datarecords have similar presentations in three aspects:size of image, font of plain text and font of link text(The font of text is determined by font-size, font-color,font-weight and font-style).

Data records usually contain three types of data contents,i.e., images, plain texts (the texts without hyperlinks) andlink texts (the texts with hyperlinks). Table 1 shows theinformation on the three aspects of data records in Figure1, and we can find that the four data records are very closeon the three aspects.

Our data record extraction solution is developed mainlybased on the above three types of visual features. Fea-ture PF is used to locate the region containing all the datarecords on a response page; feature LF and feature AF arecombined together to extract the data records accurately.

Content Feature (CF): These features hint the regular-ity of the contents in data records.

• CF1: All data records have mandatory contents andsome may have optional contents.

• CF2: The presentation of contents in a data recordfollows a fixed order.

The data records are the entities in real world, and theyconsist of data units with different semantic concepts. Thedata units can be classified into two kinds: mandatory andoptional. Mandatory units are those that must appear ineach data record. For example, if every book data recordmust have a title, then titles are mandatory data units.In contrast, optional units may be missing in some datarecords. For example, discounted price for products in e-commerce web sites is likely an optional unit because someproducts may not have discount price.

4. WEB DATA RECORD EXTRACTION

Figure 2: The content structure (a) and its VisualBlock tree (b)

Based on the visual features introduced in the previoussection, we propose a vision-based approach to extract datarecords from response pages. Our approach consists of threemain steps. First, use the VIPs [9] algorithm to constructthe Visual Block tree for each response page. Second, lo-cate the data region in the Visual Block tree based on thePF features. Third, extract the data records from the dataregion based on the LF and AF features.

4.1 Building Visual Block treeThe Vision-based Page Segmentation (VIPs) algorithm

aims to extract the content structure of a web page basedon its visual presentation. Such content structure is a treestructure, and each node in the tree corresponds to a rectan-gular region on a web page. The leaf blocks are the blocksthat cannot be segmented further, and they represent theminimum semantic units, such as continuous texts or im-ages. There is a containment relationship between a parentnode and a child node, i.e., the rectangle corresponding to achild node is contained in the rectangle corresponding to theparent node. We call this tree structure Visual Block treein this paper. In our implementation we adopt the VIPS al-gorithm to build a Visual Block tree for each response page.Figure 2(a) shows the content structure of the response pageshown in Figure 1 and Figure 2(b) gives its correspondingVisual Block tree. Actually, Visual Block tree is more com-plicated than what Figure 2 shows (there are often hundredseven thousands of blocks in a Visual Block tree).

For each block in the Visual Block tree, its position (theposition on response page) and its size (width and height)are logged. The leaf blocks can be classified into three kinds:image block, plain text block and link text block, whichrepresent three kinds of information in data records respec-tively. If a leaf block is a plain text block or a link textblock, the font information is attached to it.

4.2 Data region discoveryPF1 and PF2 indicate that the data records are the pri-

mary content on the response pages and the data regionis centrally located on these pages. The data region corre-sponds to a block in the Visual Block tree (in this paper weonly consider response pages that have only a single dataregion). We locate the data region by finding the blockthat satisfies the two PF features. Each feature can beconsidered a rule or a requirement. The first rule can beapplied directly, while the second rule can be represented

Figure 3: A general case of data region

by (areab/arearesponsepage) ≥ Tdataregion, where areab isthe area of block b, arearesponsepage is the area of the re-sponse page, and Tdataregion is the threshold used to judgewhether b is sufficiently large relative to arearesponsepage.The threshold is trained from sample response pages col-lected from different real web sites. For the blocks thatsatisfy both rules, we select the block at the lowest level inthe Visual Block tree.

4.3 Data records extraction from data regionIn order to extract data records from the data region ac-

curately, two facts must be considered. First, there may beblocks that do not belong to any data record, such as the sta-tistical information (about 2,038 matching results for java)and annotation about data records (1 2 3 4 5 [Next]). Theseblocks are called noise blocks in this paper. According toLF2, noise blocks cannot appear between data records andthey can only appear at the top or the bottom of the dataregion. Second, one data record may correspond to one ormore blocks in the Visual Block tree, and the total numberof blocks one data record contains is not fixed. For example,in Figure 1, “Buy new” price exists in all four data records,while “Used & new” price only exists in the first three datarecords. Figure 3 shows an example of a data region thathas the above problems: Block B1 (statistical information)and B9 (annotation) are noise blocks; there are three datarecords (B2 and B3 form data record 1; B4, B5 and B6form data record 2; B7 and B8 form data record 3), and thedashed boxes are the boundaries of data records.

This step is to discover the boundary of data records basedon the LF and AF features. That is, we attempt to de-termine which blocks belong to the same data record. Weachieve this with the following three sub-steps: Sub-step1:Filter out some noise blocks; Sub-step2: Cluster the remain-ing blocks by computing their appearance similarity; Sub-step3: Discover data record boundary by regrouping blocks.

4.3.1 Noise blocks filteringBecause noise blocks are always at the top or bottom, we

check the blocks located at the two positions according toLF1. If a block is not aligned flush left, it will be removedfrom the data region as a noise block. In this sub-step, wecannot assure all noise blocks are removed. For example, inFigure 3, block B9 can be removed in this sub-step, whileblock B1 cannot be removed.

4.3.2 Blocks clustering

The remaining blocks in the data region are clusteredbased on their appearance similarity. Since there are threekinds of information in data records, i.e., images, plain textand link text, the appearance similarity of blocks is com-puted from the three aspects. For images, we care aboutthe size; for plain text and link text, we care about theshared fonts. Intuitively, if two blocks are more similar onimage size, font, they should be more similar in appearance.The appearance similarity formula between two blocks B1and B2 is given below:

sim(B1, B2) = wi × simIMG(B1, B2)

+wpt × simPT (B1, B2) + wlt × simLT (B1, B2)

where simIMG(B1, B2) is the similarity based on imagesize, simPT(B1, B2) is the similarity on plain text font, andsimLT(B1, B2) is the similarity on link text font. And wi, wpt

and wlt are the weights of these similarities, respectively.Table 2 gives the formulas to compute the component simi-larities and the weights in different cases. The weight of onetype of contents is proportional to their total size relative tothe total size of the two blocks.

A simple one-pass clustering algorithm is applied. Thebasic idea of this algorithm is as follows. First, starting froman arbitrary order of all the input blocks, take the first blockfrom the list and use it to form a cluster. Next, for each ofthe remaining blocks, say B, compute its similarity with eachexisting cluster. Let C be the cluster that has the maximumsimilarity with A. If sim(B, C)> Tas for some threshold Tas,which is to be trained by sample pages (generally, Tas isset to 0.8), then add B to C; otherwise, form a new clusterbased on B. Function sim(B, C) is defined to be the averageof the similarities between B and all blocks in C computedusing the Formula above.

As an example, by applying this method to the blocks inFigure 1, the blocks containing the titles of the data recordsare clustered together after clustering, so are the prices ofdata records and other contents.

4.3.3 Blocks regroupingIn 4.3.2, the blocks in the data region are grouped into

several clusters. However, these clusters do not correspondto data records. On the contrary, the blocks in the samecluster likely all come from different data records. Accordingto AF2, the blocks in the same cluster have the same typeof contents of the data records.

The blocks in the data region are regrouped, and theblocks belonging to the same data record form a group. Thisregrouping process has the following three phases:

Phase 1. For each cluster Ci, obtain its minimum-boundingbox Ri, which is the smallest rectangle on the response pagethat can enclose all the blocks in Ci. We get the same num-ber of boxes as the clusters. Reorder the blocks in Ci fromtop to bottom according to their positions in web browser.Thus, Bi,j is above Bi,j+1 on web browser.

Phase 2. Suppose Cmax is the cluster with the maximumnumber of blocks. If there are multiple such clusters, selectthe one whose box is positioned higher than the others on theweb browser (here “higher position” is based on the highestpoint in each block). Let the number of blocks in Cmax ben. Each block in Cmax forms an initial group. So there aren initial groups (G1, G2, , Gn) with each group Gk havingonly one block Bmax,k.

Phase 3. For each cluster Ci, if Ri overlaps with Rmax on

the web browser, process all the blocks in Ci. If Ri is lower(higher) than Rmax, then for each block Bi,j in Ci, find thenearest block Bmax,k in Cmax that is higher (lower) thanBi,j and put Bi,j into Gk. When all clusters are processed,each group is a data record.

The basic idea of the process is as follows. Accordingto LF2 and LF3, no noise block can appear between datarecords, and its corresponding box will not overlap with oth-ers. So the boxes that overlap with others enclose all theblocks that belong to data records. In sub-step2 (section4.3.2), the blocks containing the data contents of the sametype will be in the same cluster (e.g., for book records, theblocks containing titles will be clustered together). Accord-ing to CF1, if a cluster has the maximum number of blocks,then the blocks in this cluster are the mandatory contents indata records, and the number of blocks in it is the numberof data records. If there is more than one such cluster, weselect one as Cmax (generally, the one whose box is higherthan the others on the web browser is selected). We selectthe blocks in Cmax as the seeds of the data records, andeach block forms an initial group. In each initial group Gk,there is only one block Bmax,k. Then we try to put theblocks in other clusters into the right groups. That meansif a block Bi,j (in Ci, Ci is not Cmax) and a block Bmax,k

(in Cmax) are in the same data record, then Bi,j should beput into the group Bmax,k belongs to. In another word, theblocks in the same data record are also in the same group.According to LF3, no two adjoining data records overlap.So for Bmax,k in Cmax, the blocks that belong to the samedata record with Bmax,k must be below Bmax,k−1 and aboveBmax,k+1. For each Ci, if Ri is lower (higher) than Rmax,then the block on top of Ri is lower (higher) than the blockon top of Rmax. According to CF2, this determines Bi,j

is lower (higher) than Bmax,k if they belong to the samedata record. So we can conclude that, if Bmax,k is the near-est block higher (lower) than Bi,j , then Bi,j is put into thegroup Bmax,k belongs to.

5. EXPERIMENTSWe have built an operational prototype system based on

our method, and we evaluate it in this section. This proto-type system is implemented with C# on a Pentium 4 2GHPC. For response pages with no more than 20 data records,the whole process takes no more than 3 seconds.

5.1 Data setMost previous works on web data extraction conducted

experimental evaluations on relatively small data sets, andas a result, the experimental results are often not very reli-able. Sometimes, the same system/approach yields very dif-ferent experimental results depending on the data sets used(e.g., see the experimental comparisons reported in [8][13]about three approaches). In general, there are two reasonsthat may lead to this situation: first, the size of the dataset used is too small, and second, the data set used is notsufficiently representative of the general situation.

In this paper, we use a much larger data set than thoseused in other similar studies to avoid the problems men-tioned above. Our data set is collected from the Complete-planet web site (www.completeplanet.com). Complete-planetis currently the largest depository for deep web, which hascollected the search entries of more than 70,000 web databasesand search engines. These search systems are organized un-der 43 topics covering all the main domains in real world.We select 1,000 web sites from these topics (the top 10 to 30web sites in each topic). During our selection, duplicates un-der different topics are not used. In addition, web sites thatare powered by well-known search engines such as Googleare not used. This is to maximize the diversity among theselected web sites. For each web site selected, we get atleast five response pages by submitting different queries toreduce randomness. Only response pages containing at leasttwo data records are used. In summary, our data set is muchlarger and more diverse than any data set used in relatedworks. We plan to make the data set publicly available inthe near future.

5.2 Performance measuresTwo measures, precision and recall, are widely used to

measure the performance of data record extraction algo-rithms in published literatures. Precision is the percentageof correctly extracted records among all extracted recordsand recall is the percentage of correctly extracted recordsamong all records that exist on response pages. In our exper-iments, a data record is correctly extracted only if anythingin it is not missed and anything not in it is not included.

Besides precision and recall, there is an important mea-sure neglected by other researchers. It is the number of websites with perfect precision and recall, i.e., both precisionand recall are 100% at the same time. This measure has a

great meaning for web data extraction in real applications.We give a simple example to explain this. Suppose thereare three approaches (A1, A2 and A3) which can extractdata records from response pages, and they use the samedata set (5 web sites, 10 data records in each web site). A1extracts 9 records for each site and they are all correct. Sothe average precision and recall of A1 are 100% and 90%,respectively. A2 extracts 11 records for each site and 10are correct. So the average precision and recall of A2 are90.9% and 100%, respectively. A3 extracts 10 records for4 of the 5 sites and they are all correct. For the 5th site,A3 extracts no records. So the average precision and recallof A3 are both 80%. Based on average precision and recall,A1 and A2 are better than A3. But in real applications A3may be the best choice. The reason is that in order to makeprecision and recall 100%, A1 and A2 have to be manuallytuned/adjusted for each web site, while A3 only needs to bemanually tuned for one web site. In other words, A3 needsthe minimum manual intervention.

In this paper we propose a new measure called revision .Its definition is given below.

revision =WSt −WSc

WSt

where WSc is the total number of web sites whose preci-sion and recall are both 100%, and WSt is total number ofweb sites processed. This measure represents the degree ofmanual intervention required.

5.3 Experimental resultsWe evaluate our prototype system ViDRE and compare it

with MDR. We choose MDR based on two considerations:first, it can be downloaded from web site and can run lo-cally; second, it is very similar to ViDRE (a single page ata time; data extracted at record level). MDR has a simi-larity threshold, which is set at default value (60%) in ourtest, based on the suggestion of the authors of MDR. OurViDRE also has a similarity threshold, which is set at 0.8.We show the experimental results in Table 3.

From Table 3, we can draw two conclusions. First, theperformance of ViDRE is very good. That means vision-based approach can also reach a high accuracy (precisionand recall). Second, ViDRE is much better than MDR onrevision. MDR has to be revised for nearly half of the websites tested, while ViDRE only need to be revised for lessthan one eighth of these sites.

6. CONCLUSION AND FUTURE WORKIn this paper, we presented a fully automated technique

to extract search result data records from response pages

dynamically generated by search engines or Web DBs. Ourtechnique utilizes only the visual content features on theresponse page, which is HTML language or any other lan-guage independent. This differentiates our technique fromother competing techniques for similar applications. Ourexperimental results on a large data set indicate that ourtechnique can achieve high extraction accuracy.

In the future, we plan to address several issues and im-prove our vision-based approach further. First, if there isonly one data record on a response page, our approach willfail. We intend to tackle this problem by comparing multi-ple response pages from one web site. Second, data recordextraction is slow when the number of data records is large(say more than 50). We plan to look into the issue of improv-ing the efficiency of our approach. Third, we plan to collecta set of response pages from real web sites which are notdesigned with HTML, and show our vision-based approachis really language independent.

7. ACKNOWLEDGMENTSThis research was partially supported by the grants from

the NSFC under grant number 60573091, 60273018, ChinaNational Basic Research and Development Program’s Se-mantic Grid Project (No. 2003CB317000), the Key Projectof Ministry of Education of China under Grant No.03044,Program for New Century Excellent Talents in University(NCET), and US NSF grants IIS-0414981 and CNS-0454298.

8. REFERENCES[1] K. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured

Databases on the Web: Observations and Implications. InSIGMOD Record, 33(3), pages 61-70, 2004.

[2] G. O. Arocena, A. O. Mendelzon. WebOQL: RestructuringDocuments, Databases, and Webs. In ICDE, pages 24-33,1998.

[3] X. Meng, H. Lu, H. Wang. SG-WRAP: A Schema-GuidedWrapper Generation. In ICDE, pages 331-332, 2002.

[4] R. Baumgartner, S. Flesca, G. Gottlob. Visual WebInformation Extraction with Lixto. In VLDB , pages119-128, 2001.

[5] C. Chang, S. Lui. IEPAD: Information extraction based onpattern discovery. In WWW, pages 681-688, 2001.

[6] V. Crescenzi, G. Mecca, P. Merialdo. Roadrunner: Towardsautomatic data extraction from large web sites. In VLDB,pages 109-118, 2001.

[7] Y. Zhai, B. Liu. Web data extraction based on partial treealignment. In WWW, pages 76-85, 2005.

[8] B. Liu, R. L. Grossman, Yanhong Zhai. Mining data recordsin Web pages. In KDD, pages 601-606, 2003.

[9] D. Cai, S. Yu, J. Wen, W. Ma. Extracting Content Structurefor Web Pages Based on Visual Representation. In APWeb,pages 406-417, 2003.

[10] D. Cai, X. He, J. Wen, W. Ma. Block-level link analysis. InSIGIR, pages 440-447, 2004.

[11] D. Cai, X. He, Z. Li, W. Ma, J. Wen. Hierarchical clusteringof WWW image search results using visual, textual and linkinformation. In ACM Multimedia, pages 952-959, 2004.

[12] X. Gu, J. Chen, W. Ma, G. Chen. Visual Based ContentUnderstanding towards Web Adaptation. In AH, pages164-173, 2002.

[13] H. Zhao, W. Meng, Z. Wu, V. Raghavan, C. T. Yu. Fullyautomatic wrapper generation for search engines. In WWW,pages 66-75, 2005.

[14] K. Simon, G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with Visual Perceptions. In CIKM,pages 381-388, 2005.

Answering Structured Queries on Unstructured Data

Jing LiuUniversity of Washington

Seattle, WA 98195

[email protected]

Xin DongUniversity of Washington

Seattle, WA 98195

[email protected]

Alon HalevyGoogle Inc.

Mountain View, CA 94022

[email protected]

ABSTRACTThere is growing number of applications that require access toboth structured and unstructured data. Such collections of datahave been referred to as dataspaces, and Dataspace Support Plat-forms (DSSPs) were proposed to offer several services over datas-paces, including search and query, source discovery and catego-rization, indexing and some forms of recovery. One of the key ser-vices of a DSSP is to provide seamless querying on the structuredand unstructured data. Querying each kind of data in isolationhas been the main subject of study for the fields of databasesand information retrieval. Recently the database community hasstudied the problem of answering keyword queries on structureddata such as relational data or XML data. The only combinationthat has not been fully explored is answering structured querieson unstructured data.

This paper explores an approach in which we carefully con-struct a keyword query from a given structured query, and submitthe query to the underlying engine (e.g., a web-search engine) forquerying unstructured data. We take the first step towards ex-tracting keywords from structured queries even without domainknowledge and propose several directions we can explore to im-prove keyword extraction when domain knowledge exists. Theexperimental results show that our algorithm works fairly wellfor a large number of datasets from various domains.

1. INTRODUCTIONSignificant interest has arisen recently in combining tech-

niques from data management and information retrieval [1,5]. This is due to the growing number of applications thatrequire access to both structured and unstructured data.Examples of such applications include data management inenterprises and government agencies, management of per-sonal information on the desktop, and management of dig-ital libraries and scientific data. Such collections of datahave been referred to as dataspaces [8, 10], and DataspaceSupport Platforms (DSSPs) were proposed to offer severalservices over dataspaces, including search and query, sourcediscovery and categorization, indexing and data recovery.

One of the key services of a DSSP is to provide seamlessquerying on the structured and unstructured data. Query-ing each kind of data in isolation has been the main subject

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WebDB ’06 Chicago, Illinois USACopyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

of study for the fields of databases and information retrieval.Recently the database community has studied the problemof answering keyword queries on structured data such asrelational data or XML data [11, 2, 4, 23, 12].

The only combination that has not been fully explored isanswering structured queries on unstructured data. Informat-ion-extraction techniques attempt to extract structure fromunstructured data such that structured queries can be ap-plied. However, such techniques rely on the existence ofsome underlying structure, so are limited especially in het-erogeneous environments.

This paper explores an approach in which we carefullyconstruct a keyword query from a given structured query,and submit the query to the underlying engine (e.g., a web-search engine) for querying unstructured data. We beginby describing the many ways our strategy can contribute insupporting query answering in a dataspace.

1.1 MotivationBroadly, our techniques apply in any context in which a

user is querying a structured data source, whereas there arealso unstructured sources that may be related. The usermay want the structured query to be expanded to includethe unstructured sources that have relevant information.

Our work is done in the context of the Semex PersonalInformation Management (PIM) System [6]. The goal ofSemex is to offer easy access to all information on one’sdesktop, with possible extension to mobile devices, importeddatabases, and the Web. The various types of data on one’sdesktop, such as emails and contacts, Latex and Bibtex files,PDF files, Word documents and Powerpoint presentations,and cached webpages, form the major data sources man-aged by Semex. On one hand, Semex extracts instancesand associations from these files by analyzing the data for-mats, and creates a database. For example, from Latex andBibtex files, it extracts Paper, Person, Conference, Journal

instances and authoredBy, publishedIn associations. On theother hand, these files contain rich text and Semex considersthem also as unstructured data.

Semex supports keyword search by returning the instanceswhose attributes contain the given keywords and the doc-uments that contain the keywords. In addition, Semex al-lows sophisticated users to compose structured queries todescribe more complex information needs. In particular,Semex provides a graphical user interface to help users com-pose triple queries, which in spirit are the same as SPARQLqueries [21] and describe the desired instances using a setof triples. Below is an example triple query asking for thepapers that cite Halevy’s Semex papers (note that users en-

Keyword Extraction

Query-graph Construction

SQL Queries XML Queries Triple Queries

Query Graph

Keyword Set

Figure 1: Our approach to keyword extraction. In the

first step we construct a query graph for the structured

input query, and in the second step we choose the node

labels and edge labels from the graph that best summa-

rize the query.

ter the queries through the user interface and never see thesyntax below):

SELECT $tFROM $pa1 as paper, $pa2 as paper, $pe as personWHERE $pa1 cite $pa2, $pa2 title ‘‘Semex’’

$pa2 author $pe, $pe name ‘‘Halevy’’$pa1 title $t

Ideally, the query engine should answer this type of queriesnot only on the database, but also on the unstructured datarepository. For example, it should be able to retrieve thePDF or Word documents that cite Halevy’s Semex papers,and retrieve the emails that mention such citations. Morebroadly, it should be able to search the Web and find suchpapers that do not reside in the personal data. The resultscan be given as file descriptions of the documents or links ofthe webpages, so the user can explore them further. This isthe first place keyword extraction is useful: we can extracta keyword set from the triple query, and perform keywordsearch on the unstructured data repository and on the web.

Continuing with this example, suppose the user has im-ported several databases from external data sources, in-cluding the XML data from the Citeseer repository, andthe technical-report data from the department relationaldatabase. When the user poses this triple query, she alsoexpects results to be retrieved from the imported data. Notethat although the imported data are structured, they do notshare the same schema with the database created by Semex.In addition, the mappings between the schemas are not givenand typically cannot be easily established. One possible so-lution is to transform the query into keyword search by key-word extraction. Then, by applying existing techniques onanswering keyword queries on structured data, Semex canretrieve relevant information even without schema mapping.

1.2 Our ContributionsIn this paper, we study how to extract keywords from

a structured query, such that searching the keywords onan unstructured data repository obtains the most relevantanswers. The goal is to obtain reasonably precise answerseven without domain knowledge, and improve the precisionif knowledge of the schema and the structured data is avail-able.

As depicted in Figure 1, the key element in our solution isto construct a query graph that captures the essence of thestructured query, such as the object instances mentioned inthe query, the attributes of these instances, and the associ-ations between these instances. With this query graph, wecan ignore syntactic aspects of the query, and distinguish

the query elements that convey different concepts. The key-word set is selected from the node and edge labels of thegraph.

Our algorithm selects attribute values and schema ele-ments that appear in the query (they also appear as nodelabels and edge labels in the query graph so are referredto as labels), and uses them as keywords to the search en-gine. When selecting the labels, we wish to include onlynecessary ones, so keyword search returns exactly the queryresults and excludes irrelevant documents. We base our se-lection on the informativeness and representativeness of alabel: the former measures the amount of information pro-vided by the label, and the latter is the complement of thedistraction that can be introduced by the label. Given aquery, we use its query graph to model the effect of a se-lected label on the informativeness of the rest of the labels.By applying a greedy algorithm, we select the labels withthe highest informativeness and representativeness.

In particular, our contributions are the following:

1. We propose a novel strategy to answering structuredqueries on unstructured data. We extract keywordsfrom structured queries and then perform keyword searchusing information retrieval.

2. We take a first step towards extracting keywords fromstructured queries even without domain knowledge,and propose several directions we can explore to im-prove keyword extraction when domain knowledge ex-ists. The experimental results show that our algorithmworks fairly well for a large number of datasets fromvarious domains.

This paper is organized as follows. Section 2 discussesrelated work. Section 3 defines the problem. Section 4 de-scribes our algorithm in selecting keywords from a givenquery. Finally, Section 5 presents experimental results andSection 6 concludes.

2. RELATED WORKThe Database community has recently considered how to

answer keyword queries on RDB data [11, 2, 4] and on XMLdata [23, 12]. In this paper, we consider the reverse direc-tion, answering structured queries on unstructured data.

There are two bodies of research related to our work: theinformation-extraction approach and the query-transformationapproach. Most information-extraction work [9, 18, 19, 20,15, 7, 3] uses supervised learning, which is hard to scale todata in a large number of domains and apply to the casewhere the query schema is unknown beforehand.

To the best of our knowledge, there is only one work,SCORE [17], considering transforming structured queriesinto keyword search. SCORE extracts keywords from queryresults on structured data and uses them to submit keywordqueries that retrieve supplementary information. Our ap-proach extracts keywords from the query itself. It is genericin that we aim to provide reasonable results even without thepresence of structured data and domain knowledge; however,the technique used in SCORE can serve as a supplement toour approach.

3. PROBLEM DEFINITIONWe define the keyword extraction problem as follows. Given

a structured query (in SQL, XQuery, etc.), we extract a set

of keywords from the query. These keywords are used toconstruct a keyword query that returns information poten-tially relevant to the structured query. A keyword searchon a large volume of unstructured data often returns manyresults; thus, we measure the quality of the answers usingtop-k precision — the percentage of relevant results in thetop-k results. We consider queries that do not contain dis-junctions, comparison predicates (e.g., 6=, <) or aggregation.Such queries are common in applications such as PIM anddigital libraries.

The following example shows some of the challenges weface.

Example 1. Consider a simple SQL query that asks forDataspaces papers published in 2005.

SELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

We have many options in keyword extraction and the fol-lowing list gives a few:

1. Use the whole query: “select title from paper where titleLIKE ‘dataspaces’ and year = ‘2005’ ”.

2. Use the terms in the query excluding syntactic sugar(e.g., select, from, where): “paper title +dataspacesyear +2005”. (Most search engines adopt the keyword-search syntax that requires the keyword following a “+”sign to occur in the returned documents or webpages.)

3. Use only the text values: “+dataspaces +2005”.4. Use a subset of terms in the query: “+dataspaces +2005

paper title”.5. Use another subset of terms in the query: “+datas-

paces +2005 paper”.

A human would most probably choose the last keyword set,which best summarizes the objects we are looking for. Indeed,at the time of the experiment, Google, Yahoo, and MSN allobtained the best results on the last keyword set (Google ob-tained 0.6 top-10 precision), and the first hits all mentionedthe dataspaces paper authored by Franklin et al. in 2005 (twoof the search engines returned exactly the paper as the tophit). In contrast, for the first two keyword sets, none of thethree search engines returned relevant results. For the thirdand fourth keyword sets, although some of the top-10 resultswere relevant, the top-10 precisions were quite low (Googleobtained 0.2 top-10 precisions on both keyword sets). 2

To explain the above results, we consider the possible ef-fects of a keyword. On the one hand, it may narrow downthe search space by requiring the returned documents to con-tain the keyword. On the other hand, it may distract thesearch engine by bringing in irrelevant documents that bychance contain the keyword. Ideally, we should choose key-words that are informative and representative: the keywordsshould significantly narrow down the search space withoutdistracting the search engine. We describe our keyword-extraction approach in the next two sections.

4. CONSTRUCTING KEYWORD QUERIESTo create a keyword query from a structured query Q, we

first construct Q’s query graph, and then extract keywordsfrom it. Intuitively, the query graph captures the essence ofthe query and already removes irrelevant syntactic symbols.In this section, we first define the query graph and thendescribe keyword extraction.

?paper

“2005” “Dataspaces” ?

y e a r t i t l e

t i t l e

Figure 2: The query graph for the query in Example 1.

The graph contains one instance node — paper, two value

nodes — “Dataspaces” and “2005”, and one question

node. The nodes are connected by three attribute edges.

4.1 Query GraphIn constructing the graph, we model the data as a set

of object instances and associations between the instances.Each instance belongs to a particular class and correspondsto a real-world object. An instance is described by a set ofattributes, the values of which are ground values. An as-sociation is a relationship between two instances. We canview a query as a subgraph pattern describing the queriedinstances with their attributes and directly or indirectly as-sociated instances.

Definition 4.1. (Query Graph) A query graph GQ =(V, E) for query Q is an undirected graph describing the in-stances and associations mentioned in Q.

• For each instance mentioned in Q, there is an instancenode in V , labelled with the name of the instance class.

• For each association mentioned in Q, there is an asso-ciation edge in E labelled with the association name.The association edge connects the two instance nodesinvolved in the association.

• For each ground value in Q, there is a value node inV labelled with the value, and an attribute edge in Elabelled with the attribute name. The attribute edgeconnects the value node and the node of the owner in-stance.

• For each queried attribute in Q, there is a questionnode in V labelled with “?”, and an attribute edge inE labelled with the attribute name. The attribute edgeconnects the question node and the node of the queriedinstance. 2

As an example, Figure 2 shows the query graph for Ex-ample 1.

It is straightforward to construct query graphs for triplequeries. The process is more tricky for SQL queries; forexample, a table in a relational database can either recorda set of object instances or a set of associations. In the fullversion of the paper [13], we describe how to construct aquery graph for SQL queries and XML queries.

4.2 Extracting KeywordsWe wish to include only necessary keywords rather than

adding all relevant ones. This principle is based on two ob-servations. First, a keyword often introduces distraction,so unnecessary keywords often lower the search quality byreturning irrelevant documents. Second, real-world docu-ments are often crisp in describing instances. For exam-ple, rather than saying “a paper authored by a person withname Halevy”, we often say “Halevy’s paper”. Involving“authored by”, “person” and “name” in the keyword setdoes not add much more information. We base our label

?paper (1, 0.6)

“2005” (1,0.2)

“Dataspaces” (1,0.8)

?

y e a r ( 0 . 8 , 0 . 2 )

t i t l e

( 1

, 0 . 2

) t i t l e

( 0 . 8 , 0 . 2 )

(a)

?paper (1, 0.6)

“2005” (1,0.2)


?

y e a r t i t l e

t i t l e

(b)

0.8

0.4

0.2/2=0.1 0.2/2=0.1

?paper (1, 0.6)

“2005” (1,0.2)


?

y e a r t i t l e

t i t l e

(c)

0.6/2 =0.3

0.6/2=0.3 0.6/2=0.3

Figure 3: The query graph with i-scores and r-scores for the query in Example 1. (a) The initial (i-score, r-score) pairs.

(b) The information flow representing the effect of the “Dataspaces” label. (b) The information flow representing the

effect of the “Paper” label.

selection on judging the informativeness and representative-ness of labels. We first introduce measures for these twocharacteristics, and then describe our algorithm.

4.2.1 Informativeness and representativenessIntuitively, informativeness measures the amount of in-

formation provided by a label term. For example, attributevalues are more informative than structure terms. Repre-sentativeness roughly corresponds to the probability thatsearching the given term returns documents or webpages inthe queried domain. For example, the term “paper” is morerepresentative than the term “title” for the publication do-main. We use i-score to measure informativeness and r-scoreto measure representativeness. Given a node label or edgelabel l, we denote its i-score as il, and r-score as rl. Both iland rl range from 0 to 1. Note that the representativenessof label l is the complement of l’s distractiveness, denotedas dl, so dl = 1− rl. Figure 3(a) shows the initial (i-score,r-score) pair for each label (we will discuss shortly how weinitialize the scores).

We observe that the informativeness of a label also de-pends on the already selected keywords. For example, con-sider searching a paper instance. The term “paper” is infor-mative if we know nothing else about the paper, but its infor-mativeness decreases if we know the paper is about “datas-paces”, and further decreases if we also know the paper is by“Halevy”. In other words, in a query graph, once we selecta label into the keyword set, the informativeness of otherlabels is reduced.

We model the effect of a selected label s on the i-scores ofother labels as an information flow, which has the followingthree characteristics:

• At the source node (or edge), the flow has volume rs.The reason is that the effect of s is limited to the searchresults that are related to the queried domain, and thispercentage is rs (by definition).

• The information flow first goes to the neighbor edgesor nodes (not including the one from which the flowcomes). If s is a label of an instance node, the flowvalue is divided among the neighbor edges. Specifi-cally, let n be the number of different labels of theneighbor edges, the flow volume on each edge is rs/n.The division observes the intuition that the more dis-tinct edges, the more information each edge label pro-vides even in presence of the s label, and thus the lesseffect s has on these labels.

• After a flow reaches a label, its volume decreases byhalf. It then continues flowing to the nodes (or edges)at the next hop and is divided again, until reachingvalue nodes or question nodes. In other words, s’seffect dwindles exponentially in the number of hops.Note that the flow is only affected by the r-score of

Figure 4: Extracting keywords from query graph in Fig-

ure 3(a). (a) The i-scores of the labels after selecting the

labels “Dataspaces” and “2005”. (b) The i-scores of the

labels after selecting the label “Paper”.

the source node, but not the r-scores of other nodesthat it reaches.

When we add a new label to the keyword set, we computethe effect of the label on the rest of the labels and updatetheir i-scores. Once a keyword set is fixed, the i-scores ofthe rest of the labels are fixed, independent of the order weselect the keywords. The detailed algorithm for computingi-scores is given in the full version of the paper [13].

Example 2. Consider the query graph in Figure 3(a).Figure 3(b) shows the effect of the value label “Dataspaces”on the i-scores of the rest of the nodes, and Figure 3(c) showsthe effect of the instance label “Paper”. Note that we divide0.6 by 2 rather than by 3, because the three edges are labelledby only two distinct labels. 2

4.2.2 Selecting labelsWhen we select node or edge labels, we wish to choose

those whose provided information is larger than the possibledistraction; that is, i > d = 1 − r, so i + r > 1. We selectlabels in a greedy fashion: in each step we choose the labelwith the highest i+r, and terminate when there are no morelabels with i+r > 1. Specifically, we proceed in three steps.

1. We choose all labels of value nodes. After adding eachlabel to the keyword set, we update the i-scores of therest of the nodes.

2. If there are labels satisfying i + r > 1, we choose theone with the largest i + r. We add the label to thekeyword set and update the i-scores of the rest of thenodes.

3. We iterate step 2 until no more labels can be added.

Example 3. Consider the query graph in Figure 3(a).We select labels in two steps. In the first step, we selectthe labels of all value nodes, “Dataspaces” and “2005”. Theupdated i-scores are shown in Figure 4(a). We then select la-bel Paper, and the updated i-scores are shown in Figure 4(b).After this step no more labels satisfy the condition i + r > 1so the algorithm terminates. The result keyword set is thus“Dataspaces 2005 paper”. 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Movie Geography Company Bibliography DBLP Car Profile

Pre

cisi

on

Top-2 Top-10 DK-Top10

Figure 5: Experimental results on six different domains.

The chart shows the top-2 and top-10 precision when we

did not apply domain knowledge, and the top-10 pre-

cision (DK-Top10) when we did apply domain knowl-

edge. Our algorithm worked fairly well in various do-

mains without domain knowledge, and the top-10 pre-

cision significantly improved when we applied domain

knowledge.

4.2.3 Initializing i-scores and r-scoresWe now discuss how to initialize the i-scores and r-scores.

When we have no domain knowledge, we assign default val-ues for different types of labels. We observe the web datafor the representativeness of different types of nodes, andassign r-scores accordingly. For i-scores, we consider valuesand the class name of the queried instance as more informa-tive and set the i-scores to 1, and consider other labels lessinformative. We will discuss the default score setting in ourexperiments in Section 5.

There are several ways to obtain more meaningful r-scoresin presence of domain knowledge, and here we suggest afew. The first method is to do keyword search on the labels.Specifically, for a label l, we search l using the unstructureddataset on which we will perform keyword search. We man-ually examine the top-k (e.g., k = 10) results and count howmany are related to the queried domain. The percentage λis considered as the r-score for the l label. Another approachis to do Naive-Bayes learning on a corpus of schemas andstructured data in the spirit of [14], but we omit the detailsfor lack of space. Note that although this training phaseis expensive, it is a one-time process and can significantlyimprove search performance.

5. EXPERIMENTAL RESULTSThis section describes a set of experiments that begin to

validate our keyword-extraction algorithm. Our goal is toshow that our algorithm performs well even without domainknowledge, and that search quality improves when domainknowledge exists.

5.1 Experiment SetupWe selected six different domains from the UW XML

repository [22] and the Niagara XML repository [16], in-cluding movie, geography, company profiles, bibliography,DBLP, and car profiles. The schemas for these domainsvary in complexity, such as the number of elements and at-tributes, and the number of children of each element.

When we selected queries, we varied two parameters in theselected queries: #values and length. The former is the num-ber of attribute values in the query, indicating the amountof value information given by the query. The latter is the

0 1 2

QUERYGRAPH VALUE VALUEQUERY VALUETABLE ALL

Top-10 precision (Length=0)

0

0.2

0.4

0.6

0.8

1

0 1

#Value

Pre

cis

ion


0

0.2

0.4

0.6

0.8

1

0 1 2

#Value

Pre

cis

ion


0

0.2

0.4

0.6

0.8

1

0 1 2

#Value

Pre

cis

ion


0

0.2

0.4

0.6

0.8

1

0 1 2

#Value

Pre

cis

ion

(a) (b)

(c) (d)

Figure 6: Top-10 precision for queries with length 0 in

(a) the movie domain and (b) the geography domain,

and with length 1 in (c) the movie domain, and (d) the

geography domain. QueryGraph beat other solutions in

most cases and the top-10 precision increased with the

growing number of attribute values. (In (a) and (b) the

ValueTable line and the QueryGraph line overlap, as the

two methods extract the same keywords.)

longest path from a queried instance (the instance whose at-tributes are queried) to other instances in the query graph,corresponding to the complexity of the structure informa-tion presented in the query. Finally, we randomly selectedtext values from the XML data for our queries. After gen-erating the keyword set from the input queries, we used theGoogle Web API to search the web.

We measured the quality of our extracted keywords bytop-k precision, which computes the percentage of the topk hits that provide information relevant to the query. Weanalyzed the results using top-2 and top-10 precision.

Finally, we set the default values for i-scores and r-scoresas follows (we used the same setting for all domains).

• i-scores: 1 for value labels and labels of queried in-stances, and 0.8 for other labels.

• r-scores: 0.8 for text-value labels and labels of asso-ciations between instances of the same type, 0.6 forinstance labels, 0.4 for association labels, 0.2 for at-tribute labels, and 0 for number-value labels.

5.2 Experimental ResultsWe validated our algorithm on six domains, and the re-

sults are shown in Figure 5. We observe that our algorithmperformed well in all domains. With our default settings fori-scores and r-scores, the top-2 and top-10 precision in differ-ent domains were similar. The average top-2 precision was0.68 and the average top-10 precision was 0.59. When weapplied domain knowledge, the top-10 precision increased39% on average.

5.2.1 Contributions of the Query GraphWe now compare QueryGraph with several other ap-

proaches that select terms directly from the query.

• All: Include all terms except syntactic symbols.

• Value: Include only ground values.

Top-10 precision (#Value=1)

0

0.2

0.4

0.6

0.8

1

1 2 3

Length

Pre

cis

ion


0

0.2

0.4

0.6

0.8

1

1 2 3

Length

Pre

cis

ion


0

0.2

0.4

0.6

0.8

1

1 2 3

Length

Pre

cis

ion


0

0.2

0.4

0.6

0.8

1

1 2 3

Length

Pre

cis

ion

0 1 2QUERYGRAPH VALUE VALUEQUERY VALUETABLE ALL

(a) (b)

(c) (d)

Figure 7: Top-10 precision of queries with one attribute

value in (a) the movie domain and (b) the geography

domain, and with two attribute values in (c) the movie

domain and (d) the geography domain. QueryGraph beat

other solutions in most cases and the top-10 precision

went down as the query length increased.

• ValueQuery: Include ground values and all table andattribute names in the SELECT-clause.

• ValueTable: Include ground values and all table namesin the FROM-clause.

We report the results on two domains: movie and geogra-phy. We observed similar trends on other domains.

Varying the number of values: We first consider theimpact of #values on keyword extraction. We consideredqueries with length 0 or 1, and varied #values from 0 to 2when it applies. Figure 6 shows the top-10 precision.

We observed the following. First, in most cases Query-

Graph obtained higher precision than the other approaches.It shows that including appropriate structure terms obtainedmuch better results than searching only the text values. Sec-ond, when the number of attribute values increases, most ap-proaches obtained better search results, but All performedeven worse because it includes distractive keywords.

Varying query length: We now examine the effect of thestructure complexity on search performance. We consideredqueries with 1 or 2 attribute values, and varied the lengthfrom 1 to 3. Figure 7 shows the results. We observed thatour algorithm again beat other methods in most cases. Asthe length of the query grew, the top-10 precision dropped.This is not a surprise as complex query structure complicatesthe meaning of the query.

6. CONCLUSIONS AND FUTURE WORKWe described an approach for extracting keyword queries

from structured queries. The extracted keyword queries canbe posed over a collection of unstructured data in order toobtain additional data that may be relevant to the struc-tured query. The ability to widen queries in this way isan important capability in querying dataspaces, that in-clude heterogeneous collections of structured and unstruc-tured data.

Although our experimental results already show that ouralgorithm obtains good results in various domains, there are

multiple directions for future work. First, we can refine ourextracted keyword set by considering the schema or maybeeven a corpus of schemas. For example, we can replace anextracted keyword with a more domain-specific keyword inthe schema; we can also add keywords selected from the cor-pus to further narrow down the search space. Second, we canuse existing structured data, as proposed in SCORE [17],to supplement the selected keyword set. Third, we can per-form some linguistic analysis of the words in the structuredquery to determine whether they are likely to be useful inkeyword queries. Finally, we would like to develop methodsfor ranking answers that are obtained from structured andunstructured data sources.

7. REFERENCES[1] S. Abiteboul and et al. The Lowell database research self

assessment. CACM, 48(5), 2005.

[2] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: Asystem for keyword-based search over relational databases.In ICDE, 2002.

[3] R. Baumgartner, S. Flesca, and G. Gottlob. Visual webinformation extraction with lixto. In VLDB, 2001.

[4] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, andS. Sudarshan. Keyword searching and browsing indatabases using BANKS. In ICDE, 2002.

[5] S. Chaudhuri, R. Ramakrishnan, and G. Weikum.Integrating DB and IR technologies: What is the sound ofone hand clapping? In CIDR, 2005.

[6] X. Dong and A. Halevy. A platform for personalinformation management and integration. In CIDR, 2005.

[7] O. Etzioni, M. Cafarella, and D. Downey. Web-scaleinformation extraction in KnowItAll (preliminary results).In Proc. of the Int. WWW Conf., 2004.

[8] M. Franklin, A. Halevy, and D. Maier. From databases todataspaces: A new abstraction for informationmanagement. Sigmod Record, 34(4):27–33, 2005.

[9] D. Freitag and A. McCallum. Information extraction withHMMs and shrinkage. In AAAI-99 Workshop on MachineLearning for Information Extraction, 1999.

[10] A. Halevy, M. Franklin, and D. Maier. Principles ofdataspace systems. In PODS, 2006.

[11] V. Hristidis and Y. Papakonstantinou. Discover: Keywordsearch in relational databases. In VLDB, 2002.

[12] V. Hristidis, Y. Papakonstantinou, and A. Balmin.Keyword proximity search on XML graphs. In ICDE, 2003.

[13] J. Liu, X. Dong, and A. Halevy. Answering structuredqueries on unstructured data. Technical Report 2006-06-03,University of Washington, 2006.

[14] J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy.Corpus-based schema matching. In ICDE, 2005.

[15] A. McCallum. Efficiently inducing features or conditionalrandom fields. In UAI, 2003.

[16] Niagara XML repository. http://www.cs.wisc.edu/niagara/data.html, 2004.

[17] P. Roy, M. Mohania, B. Bamba, and S. Raman. Towardsautomatic association of relevant unstructured content withstructured query results. In CIKM, 2005.

[18] M. Skounakis, M. Craven, and S. Ray. Hierarchical hiddenmarkov models for information extraction. In IJCAI, 2003.

[19] S. Soderland. Learning information extraction rules forsemi-structured and free text. Machine Learning,34(1-3):233–272, 1999.

[20] S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert.Crystal: Inducing a conceptual dictionary. In IJCAI, 1995.

[21] SPARQL. http://www.w3.org/TR/rdf-sparql-query/, 2003.

[22] UW XML data repository. http://www.cs.washington.edu/research/xmldatasets/, 2002.

[23] Y. Xu and Y. Papakonstantinou. Efficient keyword searchfor smallest LCAs in XML databases. In Sigmod, 2005.

Twig Patterns: From XML Trees to Graphs∗

[Extended Abstract]

Benny Kimelfeld and Yehoshua SagivThe Selim and Rachel Benin School of Engineering and Computer Science

The Hebrew University of Jerusalem, Edmond J. Safra CampusJerusalem 91904, Israel

bennyk,[email protected]

ABSTRACTExisting approaches for querying XML (e.g., XPath and twig pat-terns) assume that the data form a tree. Often, however, XML doc-uments have a graph structure, due to ID references. The commonway of adapting known techniques to XML graphs is straightfor-ward, but may result in a huge number of results, where only asmall portion of them has valuable information. We propose twomechanisms. Filtering is used for eliminating semantically weakanswers. Ranking is used for presenting the remaining answers inthe order of decreasing semantic significance. We show how tointegrate these features in a language for querying XML graphs.Query evaluation is tractable in the following sense. For a widerange of ranking functions, it is possible to generate answers inranked order with polynomial delay, under query-and-data com-plexity. This result holds even if projection is used. Furthermore,it holds for any tractable ranking function for which the top-rankedanswer can be found efficiently (assuming that equalities and in-equalities involving IDs of XML nodes are permitted in queries).

1. INTRODUCTIONTwig patterns are simple tree-structured queries for XML that

include three basic language elements, namely, arbitrarynode con-ditions, parent-child edgesandancestor-descendant edges. Thesefeatures make it possible to pose queries with only a limited knowl-edge of the XML hierarchy, the names of elements and the pre-cise data stored under each element. Furthermore, fuzzy conditions(e.g., “about(axis,value)” [23]) can be used and, so, twig patternsare applicable to information-retrieval (IR) as well as database set-tings. In summary, twig patterns provide an appealing tradeoff ofexpressiveness vs. simplicity and flexibility.

Twig patterns, however, suffer from some severe drawbacks. Forone, XML documents (e.g., DBLP1 and Mondial2) are often graphsand not trees, due to ID references. Consequently, there could bemany different paths between any given pair of nodes, leading toa potentially huge number of matches for descendant edges whenapplying the conventional approach [24] of generalizing twig pat-terns to XML graphs. Our experience shows that some simple andnatural twig queries over DBLP have an unexpected huge num-ber (tens of thousands) of answers, if ID references are taken into

∗This research was supported by The Israel Science Foundation(Grants 96/01 and 893/05).1http://dblp.uni-trier.de/xml2http://www.dbis.informatik.uni-goettingen.de/Mondial

Copyright is held by the author/owner.Ninth International Workshop on the Web and Databases (WebDB 2006),June 30, 2006, Chicago, Illinois.

account. Most of these answers are derived from rather weak se-mantic relationships among XML elements. For example, nodeswith high out-degree usually represent weak semantic connectionsand, yet, they are included in many paths. As another example,long paths are commonly viewed (e.g., [4, 13, 11, 1, 15, 19]) as anindication of less meaningful relationships, but XML graphs havemany such paths. Thus, querying XML graphs using twig patternsis often ineffective.

In this paper, we investigate essential properties for facilitatingeffective querying of XML graphs. In particular, we present a lan-guage that incorporates filtering and ranking mechanisms while re-taining the simplicity and efficiency of twigs. In our framework,nodes and edges of XML graphs have weights, which is not new.But our treatment of weights is novel in two aspects. First, whena user formulates a query, she can override and fine-tune some ofthese weights. This is an essential feature, since different usershave diverging views regarding the strength of some semantic con-nections. Second, weights are used not just for ranking, but also forfiltering; that is, the user can decide from the outset that she is notinterested in paths above a certain length.

The most important feature of our language is the ability to gen-erate the top-k answers quickly, according to a wide range of rank-ing functions. In principle, the following two properties are nec-essary for efficiently finding the top-k answers. First, the rankingfunction should be efficiently computable. Second, it should bepossible to generate the first (i.e., top-ranked) answer efficiently.We show that in our language, these two properties are also suffi-cient for generating answers in ranked order with polynomial delay(in the size of query and the XML graph) between consecutive an-swers. We identify a large family of ranking functions that have theabove properties. These functions satisfy a monotonicity conditionthat makes it possible to compute the top answer bottom-up.

It is shown that our complexity results hold even if projectionsare used and duplicates are eliminated. Note that simply applyingprojection in the last step is not enough for deriving this result,since intermediate results could be exponentially larger than thefinal one. Moreover, this result holds even if the ranking functiondepends on data that is projected out and, hence, not included inthe final result.

Earlier work on ranked evaluation over XML [22, 14, 6, 3, 2,20] considered only trees and, as a result, did not address the mainissues of this paper. The approaches of [22, 14] are based on thethreshold algorithm[10], which is designed for joining objects thathave shared keys and, hence, cannot guarantee polynomial delaywhen evaluating tree-structured queries. The work of [3, 2, 20]considered ranking functions that measure the amount ofstructuralrelaxationsof patterns and, so, are different from ours. Evaluatingtwig pattern over general graphs was discussed in [24], but they did

D. WilliamsCQs

ComputingOptimizingXPath

A. Levy

Equivalence

J. Cohen B. Smith1995CQs

19922002 1997 D. Williams Rewriting

of CQs

3.author 12.year

15.cites

8.inproceedings

19.author

18.article

21.year17.year6.year

7.cites

11.title 14.author

13.inproceedings2.article

16.title

1.bib

20.title

9.author5.title4.author

10.cites

Figure 1: An XML graph G

not address the issue of dealing with many answers that have vary-ing degrees of semantic strength. In particular, they neither con-sidered ranking nor provided a formal upper bound on the runningtime of their algorithms. A language that extends XPath to generalXML graphs was discussed in [9]. That language enables the userto bound the length of a path by specifying how manystepsthepath can follow, where a step is an XPath axis. However, that lan-guage does not handle weights or ranking functions. Furthermore,the complexity of the language is not stated. The work of [21] isfundamental to evaluating regular path expression over graph data-bases. It should be noted that defining edge conditions by meansof regular expressions can be easily incorporated in our query lan-guage, while preserving all of our complexity results. In a recentpaper [17], we considered ordered evaluation of acyclic conjunc-tive queries. Some of the techniques presented there are needed forproving the complexity results of this paper.

2. XML GRAPHS AND TWIG PATTERNSAn XML graph is directed and rooted. The nodes have labels

and possibly values. There are two types of edges:elementedgesandreferenceedges. The element edges form a spanning tree thathas the same root as the XML graph itself. We use the terminologyof XML nodesandXML edgesto distinguish them from nodes andedges of trees that represent queries.

E 2.1. Figure 1 shows an XML graph G that representsbibliographic data. Values are written in italic and reference edgesare depicted by dotted lines. This graph comprises publications ofvarious types that are described by labels, e.g., article, inproceed-ings, etc. The label of the root is bib. Citations are represented byreference edges. Each node is identified by a unique integer.

We now describe twig patterns and their conventional evaluationover XML graphs. Atwig is a directed treeT with child edgesand descendantedges. Each noden of T has an attached unarypredicate, denoted bycond(n), that is defined over XML nodes.For example,cond(n) can specify that the label of the XML nodeis article, or that the string “query optimization” should appear inthe value attached to the XML node. It can also be any booleancombination of the two. In general,cond(n) can be any conditionthat is checkable in polynomial time.

Following [24], a match of a twig in an XML graph is defined inthe usual way, except that element edges and reference edges aretreated equally. Formally, amatchof a twig T in an XML graphGis a mappingM from the nodes ofT to the nodes ofG, such that:(1) For each noden of T, the XML nodeM(n) satisfiescond(n); (2)For each child edgee = (n1, n2) of T, the XML graphG has either

an element or a reference edge fromM(n1) to M(n2); and (3)Foreach descendant edgee = (n1,n2) of T, the XML graphG has adirected path (comprising edges of any type) fromM(n1) to M(n2).The path must have at least one edge, but could start and end in thesame XML node, due to cycles.3 We useM(T,G) to denote the setof all matches from the twigT to the XML graphG.

E 2.2. Consider the bibliographic data of Figure 1. In-tuitively, an evidence for apotential connectionbetween two pub-lications is the existence of a directed path that starts in one pub-lication and leads to the second publication through a number ofcitations. Accordingly, the twig T1 of Figure 2(a) is a query aboutpotential connections between the publications written by B. Smithon equivalence of queries and other work that has been done since1997. Note that a child edge is represented by a single line whereasa descendant edge is depicted by a double line. The direction ofedges is from top to bottom.

All the predicates of T1 have at least one conjunct that refers tothe label of the corresponding XML node. This conjunct is rep-resented by either anexplicit label (e.g., article) or the wild-cardsymbol∗ that is always satisfied (i.e., the same astrue). Some pred-icates have a second conjunct (the∧ symbol is not used explicitly).

A match of T1 in G must map the node labeled with article tonode 18 of G, in order to satisfy the two predicates attached to thechildren of the former node. The predicate of the root of T1 hasone conjunct, namely∗, and it must have children labeled with titleand year, where the value of the year node is at least 1997. Thus,the root of T1 can be mapped to either node 2 or 8, since there is adirected path from each of them to node 18.

3. DBTWIG QUERIESWhen applying twigs to XML graphs (rather than trees), the user

might be overwhelmed with a large number of answers. More-over, the semantic strength varies widely between these answers.That is, some matches are meaningful while many others have lowor no semantic value. For example, consider again the twigT1

of Figure 2(a). A match of this twig in a large XML document(e.g., DBLP) can connect some publication to an article by Smiththrough a very long sequence of citations or through a book thathas a huge bibliography on a wide range of topics.

In this paper, we propose an approach that uses bothfiltering andranking. Filtering excludes matches that are not likely to be mean-ingful answers. Ranking is used to produce answers in the order

3In principle, we can allowdescendant-or-selfedges and also letthe user specify that a path must start and end in different nodes.

(a) The twigT1

contains(‘Equivalence’)

≥ 1997

equals(‘B. Smith’)

year

article

author

*

title

title

(b) The DBTwigT2

≥ 1997

contains(‘Equivalence’)

〈∗:0,cite:1,book:∞

〉2

equals(‘B. Smith’)

title

*

year

title

article

author

Figure 2: A twig and a DBTwig

of decreasing significance. The challenge is to develop effectivefiltering and ranking mechanisms that retain the simplicity and effi-ciency of twigs. These mechanisms should be built into the system,so that formulating queries will be an easy task. However, since thenotion of a “semantically significant” answer varies from one per-son to another, the query language should enable users to tweak thefiltering and ranking mechanisms.

3.1 DBTwig PatternsAs illustrated in Example 2.2, a simple yet effective way of filter-

ing out semantically weak matches is by specifying upper boundson the length of paths that correspond to descendant edges. Othertypes of conditions are also possible provided that they can bechecked in polynomial time. For example, a user can specify thatthe path corresponding to a given descendant edge should have atmost one node labeled with cites and no node labeled with book.Some seemingly simple conditions cannot be checked efficiently(e.g., it is NP-complete to determine whether two given nodes ofan XML graph are connected by a path that has no repeated labels).

To fully utilize conditions on lengths of paths, we enrich our datamodel by adding weights to the nodes and edges of XML graphs.These weights are also used by the ranking functions that are dis-cussed later. Various considerations can be used in order to deter-mine the weights. For example, the weight of a specific XML nodecan indicate the importance of that node. Similarly, the weight ofan XML edge could be derived from the strength of the semanticconnection represented by that edge. We will not get into furtherdetails, since this has already been investigated (e.g., [4]).

We denote the weights of an XML nodev and an XML edgee by w(v) andw(e), respectively; note that these weights are non-negative numbers. The weight of a path is the sum of the weightsof all the edges and all the interior nodes.

To simplify the presentation, we use only upper bounds on theweights of paths as filtering conditions. We also provide a mech-anism that enables users to fine-tune the weights of nodes that arepredefined in XML graphs, by means of specifying weight schemesin the queries that they pose. Each descendant edge may have itsown weight scheme that applies only to paths corresponding to thatedge. The formal definitions are as follows.

We introducedistance-bounding twigs(abbr.DBTwigs) that gen-eralize ordinary twigs with two additional features. First, eachedgee has adistance boundthat is denoted bydb(e) and is a non-negative number (the default value is∞). Second, each descendantedge may have aweight scheme, denoted byws(e), that assigns

nonnegative weights to XML nodes according to their labels. Theweight schemews(e) is a sequence of the form〈l1:w1, . . . , lk:wk〉,where eachl i is either a label or∗, and eachwi is a non-negativenumber.

Consider an XML graphG and a DBTwigT. The definitionof a matchM of T in G is modified as follows. In Part (2), ife = (n1,n2) is a child edge ofT, then there must be an edgefrom M(n1) to M(n2) that has a weight of no more thandb(e). InPart (3), ife = (n1,n2) is a descendant edge ofT, then there mustbe a path fromM(n1) to M(n2) that has a weight of no more thandb(e). If, in addition, the descendant edgeehas the weight schemews(e) = 〈l1:w1, . . . , lk:wk〉, then the weight of the correspondingpath is calculated after changing the weights of the interior nodesas follows. Suppose thatv is an XML node with the labell. Theweight ofv is replaced with the weightwi , wherel i is eitherl or ∗.If more than onel i matches the labell of v, then the last one inws(e)is used. If nol i matchesl, the weight ofv remains unchanged.

E 3.1. Consider the DBTwig T2 of Figure 2(b) and theXML graph G of Figure 1. Suppose that all the XML nodes haveweight 1 and all the XML edges have weight 0. Let e denote the onlydescendant edge of T2. The weight scheme of e assigns 0 to all thelabels of G, except for the labels cite and book that get the weights1 and∞, respectively. Since db(e) = 2, a path P of G matches e ifand only if it has at most two interior nodes labeled with cite andno interior node labeled with book. Thus, there is only one matchof T2 in G and it maps the root of T2 to node 8. The root cannotbe mapped to node 2 of G, because the path of G from node 2 tonode 18 contains three interior nodes that are labeled with cite.

In practice, users do have to formulate explicit DBTwigs. Theactual query language may consist of simpler (and less expressive)features that can be translated into DBTwigs. For example, the usermay specify, for a given descendant edge, a set of forbidden labels(i.e., labels that cannot appear in a matching path). She can alsospecify a set of labels and an upper bound on the total number ofoccurrences of labels from that set in a matching path.

In [9], filtering conditions are upper bounds on the number ofstepsneeded to connect two XML nodes, where a step is an XPathaxis. We believe that our approach of using weights and upperbounds makes it easier to express natural conditions for eliminatingsemantically weak matches.

3.2 Ranking of MatchesDBTwigs, in comparison to twigs, eliminate matches that do not

satisfy the distance bounds. This could still leave a large number ofanswers with varying degrees of semantic strength. Hence, the an-swers should be presented to users in ranked order. In this section,we discuss ranking functions.

Formally, aranking functionρ defines a numerical value, de-noted byρ(M,T,G), whereM is a match of a DBTwigT in anXML graphG. The valueρ(M,T,G) determines thequality of M.The matches should be presented to the user inranked order, thatis, if ρ(M1,T,G) > ρ(M2,T,G), thenM1 should appear beforeM2.

We now give several examples of ranking functions. Note thatsome of these functions may be unintuitive, but they are neededlater in order to demonstrate our results. Consider an XML graphG and a DBTwigT. Let e be a descendant edge ofT. We useGe

to denote the XML graph that is obtained fromG by replacing theweights of nodes according to the weight scheme ofe (if e doesnot have a weight scheme, thenGe = G). Given two nodesv1 andv2 of G, thee-distancefrom v1 to v2, denoted byDe

G(v1, v2), is theminimum weight among all paths ofGe from v1 to v2 (recall that a

path must have at least one edge and, hence, is a cycle ifv1 = v2).If e is a child edge ofT, thenDe

G(v1, v2) is just the weight of theXML edge fromv1 to v2, if this edge exists, and is∞ otherwise.

Note that, in principle, a descendant edge of a DBTwig may havetwo distinct weight schemes: one is used for filtering out matches,as described earlier, and the other is used for ranking.

The first ranking function that we define,ρΣd, is the sum of thee-distances, between the corresponding images underM, over alledgese of T. Formally, we useE(T) to denote the set of edges ofT and define

ρΣd(M,T,G) = −∑

(n,n′)∈E(T)

D(n,n′)G (M(n),M(n′)).

Note that the values returned byρΣd are negative, since a largerweight indicates a weaker semantic connections and, hence, theranking should be lower.

For the next ranking function,P(T) denotes the set of all pathsP of T, such thatP starts at the root ofT and ends at a leaf. Givena pathP ∈ P(T), we use (n,n′) ∈ P to denote that the edge (n,n′) isin P. The ranking functionρh returns the maximum weight amongall paths ofG that correspond, underM, to paths from the root ofTto some leaf. It returns a negative value that is defined as follows.

ρh(M,T,G) = − maxP∈P(T)

∑

(n,n′)∈P

D(n,n′)G (M(n),M(n′))

Given a matchM of a DBTwigT in an XML graphG, letN(M,T,G)denote the minimal numberN′, such thatG contains a subgraphG′

with N′ nodes andM is a match ofT in G′. Then we define

ρms(M,T,G) = −N(M,T,G).

The next ranking function,ρ#l(M,T,G), counts the number of dis-tinct labels that appear in the image ofM, i.e.,

ρ#l(M,T,G) = |l | n is a node ofT andl is the label ofM(n)|.

Finally, suppose that the nodes ofT contain vague conditions, e.g.,“about(‘XML’)” as used in NEXI [23]. We assume that there is aranking function f that, given a noden of T and a nodev of G,returns a numeric valuef (n, v) that measures the extent to whichvmatchesn. We useN(T) to denote the set of nodes ofT and define

ρfn(M,T,G) =∑

n∈N(T)

f (n,M(n)).

Note that the functionf is predefined by the system, but it could usesome specific information that is attached to noden (for example,the keywords that are specified incond(n)).

Ranking is not merely sorting according to the filtering condi-tions. A ranking function can be defined over the whole image of amatch; for example, the ranking functionρΣd is defined as the sumof thee-distances. A filtering condition, on the other hand, refersjust to a single edge (n1,n2) of a DBTwig (e.g., the distance betweenthe images of the nodesn1 andn2 is no more than 5). Thus, rank-ing functions are not just for sorting the answers. In fact, they arealso an advanced form of filtering, since they can be used for elim-inating answers that have a rank that is below a given threshold.The user may also specify that only the top-k answers should begenerated, e.g., by using mechanisms like SQL’sORDER BY clausecombined with theSTOP AFTER operator [7, 8].

3.3 ProjectionsIn many cases, the user is not interested in seeing the whole

match of a DBTwig in an XML graph, but rather only wants toget a subset of the nodes. Formally, given a DBTwigT, the usermay specify asequence of projected nodes, i.e., a sequencep =

〈n1, . . . ,nk〉 of nodes ofT. For a matchM of T in an XML graphG, we useπp(M) to denote the tuple〈M(n1), . . . ,M(nk)〉. GivenT,p andG, the goal is to generate the tuplesπp(M) for all matchesMof T in G.

One way of handling projections is by applying them as a post-processing phase, i.e., after generating the matches (in a ranked or-der). This approach, however, is inherently inefficient, since (expo-nentially) many matches may result in the same answer. Therefore,it is important to develop algorithms that can apply projections (andeliminate duplicate answers) as early as possible.

3.4 DBTwig QueriesFormally, aDBTwig queryis a tripleQ = 〈T, p, ρ〉, such thatT is

a DBTwig, p is a projection sequence andρ is a ranking function.Given a DBTwig queryQ = 〈T, p, ρ〉 and an XML graphG, there-sult of Q in G, denoted byQ(G), consists of all projectionsπp(M),whereM is a match ofT in G, i.e.,Q(G) = πp(M) | M ∈ M(T,G).To take the ranking function into account, the answers should begiven to the user in the order obtained from the following process.

1. ComputeM(T,G).

2. SortM(T,G) according toρ and letM1, . . . ,Mn be the re-sulting sequence.

3. Apply projection to obtain the sequenceπp(M1), . . . , πp(Mn).

4. Remove duplicates fromπp(M1), . . . , πp(Mn), i.e., delete everyπp(Mi), such thatπp(Mi) = πp(M j) for somej < i.

The above oder can be equivalently defined as follows. Lett1, . . . , tmbe a sequence that consists of all the tuples ofQ(G) without dupli-cates. First, we defineρi (1 ≤ i ≤ m) to be the maximal rank ofany match that producesti by projection. That is,ρi is the maximalnumberr, such that there exists a matchM ∈ M(T,G) that satisfiesπp(M) = ti andρ(M,T,G) = r. Now, we say that the sequencet1, . . . , tm is in ranked orderif for all 1 ≤ i ≤ j ≤ m, it holds thatρi ≥ ρ j . The goal is to generate the tuples ofQ(G) in ranked order.

4. COMPLEXITYA complexity analysis of query languages over XML graphs

must take into account the fact that the number of answers couldbe huge. Thus, an evaluation algorithm should be deemed effi-cient only if it computes the answers incrementally in ranked order,rather than merely generating the whole result before sorting it.

In this section, we consider the complexity of evaluating DBTwigqueries. Consider a queryQ and an XML graphG. Polynomial run-ning timeis not a suitable yardstick for measuring the efficiency ofalgorithms for evaluatingQ, since the number of tuples inQ(G)could be exponential in the size ofQ (even ifG is a tree). The re-sults of [25] imply thatQ(G) can be evaluated inpolynomial totaltime, i.e., the running time is bounded by a polynomial in the com-bined size of the input (i.e.,Q andG) and the output (i.e.,Q(G)).(We assume that the conditions attached to nodes can be computedin polynomial time in the size of the input.) Polynomial total time,however, is not good enough for the task of evaluating DBTwigqueries, for the following reasons. An algorithm that runs in poly-nomial total time (e.g., the one of [25]) requires generating all theanswers before we can be certain that the top-ranked answer (or thetop-k answers) have already been found, let alone before we canstart producing the answers in ranked order. Thus, the user gets thefirst few answers only after the whole result (which could be verylarge) is generated and sorted. Moreover, recall that the positionof a tuple in the ranked order is determined by the maximal rank

over all matches that yield this tuple when projection is applied.Consequently, in order to sortQ(G), we need an additional step ofcomputing for each answer the maximal rank among all matchesthat generate it. So, we should develop evaluation algorithms thatenumerate the answers in ranked order withpolynomial delay[12],that is, the time between two consecutive tuples is polynomial inthe size of the input.

Consider a ranking functionρ. Theρ−E problem is definedas follows. Given a DBTwig queryQ = 〈T, p, ρ〉 and an XMLgraphG, enumerate all the tuples ofQ(G) in ranked order. Anotherproblem of interest,ρ−T, is the restriction ofρ−E to the firsttuple, namely, given a DBTwigT and a an XML graphG, find atop-ranked match, i.e., a matchM ∈ M(T,G), such thatρ(M) ≥ρ(M′) for all M′ ∈ M(T,G).

To explain our complexity results, let us first consider rankingfunctions that do not have efficient evaluation algorithms.

P 4.1. The problemsρms−E andρ#l−E cannotbe solved with polynomial delay, unless P=NP.

The above proposition is a direct corollary of the fact that the prob-lemsρms−T andρ#l−T are NP-hard. The first ranking func-tion,ρms, is even intractable to compute; that is, the following prob-lem is NP-complete: given an XML graph (or tree)G, a DBTwigT, a matchM ∈ M(T,G) and a numberr, determine whetherρms(M,T,G) ≥ r. Note, however, that the ranking functionρ#l isclearly computable in polynomial time. In the sequel, we only con-sider ranking functions that can be computed in polynomial time inthe size of the input (i.e.,Q andG).

Next, we show that, under reasonable assumptions, tractabilityof finding the top-ranked match is not only necessary but also suffi-cient for enumerating all the answers in ranked order with polyno-mial delay. As a corollary, in the next section, we show that thereare efficient evaluation algorithms for the other ranking functionspresented earlier.

The following theorem shows that, for a ranking functionρ, theρ−E problem can be reduced to theρ−T problem. This theo-rem can be proved by adapting the procedure of Lawler [18] (whichis a generalization of the work by Yen [26, 27]) for computing thetop-k solutions to discrete optimization problems. To obtain thisresult, we need to assume the following. First, each XML node hasa uniqueid and the conditions attached to a noden of a DBTwigcan include conjuncts of the form:id(n) = i andid(n) , i (note thatthese conjuncts can be computed in polynomial time in the size ofthe input). Second, the ranking of an answer is not changed as aresult of adding conjuncts of the above form. More formally, weassume that if the DBTwigT′ is obtained fromT by adding con-juncts of this type, thenρ(M,T,G) = ρ(M,T′,G) for every matchM ∈ M(T,G) ∩M(T′,G).

T 4.2. The following are equivalent, under query-and-date complexity, for a ranking functionρ.

• ρ−T can be solved in polynomial time.

• ρ−E can be solved with polynomial delay.

4.1 Monotonic Ranking FunctionsIn this section, we present a family of ranking functionsρ, such

that a top-ranked match of a DBTwig in an XML graphG can becomputed efficiently. Thus, by Theorem 4.2, there are efficientevaluation algorithms for DBTwig queries that use these rankingfunctions. First, we need some notation.

Consider a DBTwigT and a noden of T. An n-branch BofT is a subtree ofT that consists ofn and all the descendants of

T − B

n4 n5

n6

n3

n8

n0

n1

n2

n7

n9

T

B n3

n4

n0

n1

n2 n9

Figure 3: An example of a branch

one child ofn. In other words,B is obtained from the subtree ofT that is rooted atn by pruning all the children ofn except one.We useT − B to denote the DBTwig that is obtained fromT byremoving all the nodes ofB, except for its root. As an example, theleft part of Figure 3 shows a DBTwigT and onen3-branchB that issurrounded by a dotted polygon. The right side of this figure showsthe DBTwigT − B.

Consider a DBTwigT, an XML graphG and a ranking functionρ. Let n be a node ofT andB be ann-branch ofT. Note that eachof T, B andT − B is a DBTwig. Also note thatn is the only nodethat belongs to bothB andT − B. Let MT−B be a match ofT − Bin G and MB be a match ofB in G, such thatMT−B(n) = MB(n).We useMB ⊕ MT−B to denote the match ofT in G that is obtainedby combiningMB and MT−B as follows. If n belongs toT − B,then (MB ⊕ MT−B)(n) = MT−B(n); otherwise (i.e., ˆn belongs toB),(MB ⊕ MT−B)(n) = MB(n)

Consider a DBTwigT and an XML graphG. We consider rank-ing functionsρ that are monotonic in the following sense. If, ina given match, we replace the mapping of one branch ofT witha mapping that has a higher rank (over the given branch), thenthe ranking of the whole match can only improve. Formally, aranking functionρ is branch-monotonicw.r.t. G andT if the fol-lowing holds. For all nodesn of T and n-branchesB of T, ifM1

B,M2B ∈ M(B,G), M′ ∈ M(T − B,G), M1

B(n) = M2B(n) = M′(n)

andρ(M1B, B,G) ≥ ρ(M2

B, B,G), thenρ(M1B ⊕ M′,T,G) ≥ ρ(M2

B ⊕

M′,T,G). We say thatρ is branch-monotonic if it is monotonicw.r.t. all XML graphsG and all DBTwigsT. For example, theranking functionsρΣd, ρh andρfn are branch-monotonic.

If ρ is branch-monotonic, then we can compute the top-rankedmatch in polynomial time in the size ofQ andG. As a corollary toTheorem 4.2, we can also enumerate answers of DBTwig queriesin ranked order with polynomial delay.

T 4.3. Consider an XML graph G, a DBTwig query Q=〈T, p, ρ〉 and a ranking functionρ that is branch-monotonic w.r.t. Gand T. Then the following hold under query-and-data complexity.

• The top-ranked match inM(T,G) can be found in polyno-mial time;

• Q(G) can be enumerated in ranked order with polynomialdelay.

C 4.4. If ρ is branch-monotonic, thenρ−E can besolved with polynomial delay, under query-and-data complexity.

5. CONCLUSIONWhen querying XML graphs, one has to deal with a huge number

of answers that have varying degrees of semantic strength. The so-lution that we have presented incorporates several ideas. First, thenodes and edges of XML graphs have weights, but the user is ableto do some fine tuning by overriding these weights. Second, theweights are used not just for ranking, but also for an a priori elim-ination of matches that map some pairs of adjacent nodes (from

the DBTwig) to XML nodes that do not satisfy the filtering condi-tions (i.e., distance bounds) of the corresponding edges. Third, fora wide range of ranking functions, the answers can be enumeratedin ranked order with polynomial delay, under query-and-data com-plexity, even if projection is used and duplicates are eliminated.

The branch-monotonic ranking functions can combine a varietyof measures. The weights introduce a database point of view, sincethey measure the semantic strength of connections. We can com-bine this measure with a functionf (such as the one given towardthe end in Section 3.2) that determines, for a noden of a DBTwigT,the relevance of the imageM(n). Note thatf is a function of bothn andM(n) and, in particular, it can use some information attachedto n (e.g., keywords) in order to calculate an IR score (e.g., by ap-plying to M(n) a formula based on tf/idf). The function f can alsotake into account a score based on a link (i.e., XLink or ID ref-erence) analysis, similarly to PageRank [5]. Thus, the followingbranch-monotonic ranking functionρ can combine a variety of IRmethods, which are based on the relevance of XML nodes to key-words (e.g., [14, 22]), with the notion of graph proximity (e.g., [4,11, 15]) that is derived from the weights. Note that in the formulabelow,λ is a constant parameter that satisfies 0< λ < 1.

ρ(T,M,G) = λρΣd(M,T,G) + (1− λ)ρfn(M,T,G)

The expressiveness of DBTwigs can be easily extended whilepreserving our complexity results. In particular, it is possible to at-tach constraints of any type to edges, provided that these constraints(on pairs of XML nodes) can be computed in polynomial time inthe size of the input. Thus, the constraints could be, for example,regular path expressions. We believe, however, that it is simpler andmore flexible to specify paths (that match edges of the DBTwig) byusing weight schemes rather than regular path expressions.

Our notion of a match is a natural generalization of the usualone. Namely, a match is a mapping from the nodes of a givenDBTwig to XML nodes, such that the images of adjacent nodes(from the DBTwig) satisfy certain conditions. These conditionscould refer to the paths connecting the XML nodes (e.g., the imagesare connected by a path with a weight of at most 5). Sometimes theuser might want to see not just the match itself but also thewitnesspaths, that is, paths showing that the conditions are indeed satisfied.Similarly, some ranking functions (e.g.,ρΣd) refer to paths and theuser might want to see the witness paths that actually determine theranking of the match. An answer is derived (by projection) from amatch that satisfies the filtering conditions and has some ranking.However, the witness paths of the filtering are not necessarily thesame as those of the ranking. In many cases, the witness paths ofthe ranking also satisfy the distance bounds. This is the case, forexample, if the ranking function is of the form described above (andthe same weight schemes are used for the filtering and ranking).But this is not always the case. An interesting research problem isto develop a notion of a match guaranteeing that the witness pathsof the ranking always satisfy the filtering conditions.

Finally, we note that the work on keyword proximity search of [15,16] deals with a different problem. There, a query is just a list ofkeywords and answers need not match any specific pattern.

6. REFERENCES[1] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: enabling

keyword search over relational databases. InSIGMOD, 2002.[2] S. Amer-Yahia, S. Cho, and D. Srivastava. Tree pattern

relaxation. InEDBT, 2002.[3] S. Amer-Yahia, N. Koudas, A. Marian, D. Srivastava, and

D. Toman. Structure and content scoring for XML. InVLDB,2005.

[4] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, andS. Sudarshan. Keyword searching and browsing in databasesusing BANKS. InICDE, 2002.

[5] S. Brin and L. Page. The anatomy of a large-scalehypertextual web search engine.Computer Networks,30(1-7), 1998.

[6] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins:optimal XML pattern matching. InSIGMOD, 2002.

[7] M. J. Carey and D. Kossmann. On saying ”enough already!”in SQL. InSIGMOD, pages 219–230, 1997.

[8] M. J. Carey and D. Kossmann. Reducing the braking distanceof an SQL query engine. InVLDB, pages 158–169, 1998.

[9] S. Cassidy. Generalizing XPath for directed graphs. InExtreme Markup Languages, 2003.

[10] R. Fagin, A. Lotem, and M. Naor. Optimal aggregationalgorithms for middleware.J. Comput. Syst. Sci., 66(4),2003.

[11] V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keywordproximity search on XML graphs. InICDE, 2003.

[12] D. S. Johnson, M. Yannakakis, and C. H. Papadimitriou. Ongenerating all maximal independent sets.InformationProcessing Letters, 27, March 1988.

[13] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan,R. Desai, and H. Karambelkar. Bidirectional expansion forkeyword search on graph databases. InVLDB, 2005.

[14] R. Kaushik, R. Krishnamurthy, J. F. Naughton, andR. Ramakrishnan. On the integration of structure indexes andinverted lists. InSIGMOD, 2004.

[15] B. Kimelfeld and Y. Sagiv. Efficient engines for keywordproximity search. InWebDB, 2005.

[16] B. Kimelfeld and Y. Sagiv. Finding and approximating top-kanswers in keyword proximity search. InPODS, 2006.

[17] B. Kimelfeld and Y. Sagiv. Incrementally computing orderedanswers of acyclic conjunctive queries. InNGITS, 2006.

[18] E. L. Lawler. A procedure for computing the k best solutionsto discrete optimization problems and its application to theshortest path problem.Management Science, 18, 1972.

[19] W.-S. Li, K. S. Candan, Q. Vu, and D. Agrawal. Retrievingand organizing web pages by “information unit”. InWWW,2001.

[20] A. Marian, S. Amer-Yahia, N. Koudas, and D. Srivastava.Adaptive processing of top-k queries in XML. InICDE,2005.

[21] A. O. Mendelzon and P. T. Wood. Finding regular simplepaths in graph databases.SIAM J. Comput., 24(6), 1995.

[22] M. Theobald, R. Schenkel, and G. Weikum. An efficient andversatile query engine for TopX search. InVLDB, 2005.

[23] A. Trotman and B. Sigurbjornsson. Narrowed ExtendedXPath I (NEXI). In INEX, pages 16–40, 2004.

[24] Z. Vagena, M. M. Moro, and V. J. Tsotras. Twig queryprocessing over graph-structured XML data. InWebDB,2004.

[25] M. Yannakakis. Algorithms for acyclic database schemes. InVLDB, 1981.

[26] J. Y. Yen. Finding thek shortest loopless paths in a network.Management Science, 17, 1971.

[27] J. Y. Yen. Another algorithm for finding thek shortestloopless network paths. In”Proc. 41st Mtg. OperationsResearch Society of America”, volume 20, 1972.

The Meaning of Erasing in RDF under theKatsuno-Mendelzon Approach

In Memory of Alberto O. Mendelzon

Claudio Gutierrezc Carlos Hurtadoc Alejandro Vaismanc

cDepartment of Computer Science, Universidad de Chilecgutierr,churtado,[email protected]

ABSTRACTThe basic data model for the Semantic Web is RDF. In thispaper we address updates in RDF. It is known that the se-mantics of updates for data models becomes unclear whenthe model turns, even slightly, more general than a sim-ple relational structure. Using the framework of Katsuno-Mendelzon, we define a semantics for updates in RDF. Par-ticularly we explore the behavior of this semantics for the“erase” operator (which in general is not expressible in RDF).Our results include a proposal of sound semantics for RDFupdates, a characterization of the maximal RDF graph whichcaptures exactly all consequences of the erase operation ex-pressible in RDF, and complexity results about the compu-tation of this graph and updates in RDF in general.

1. INTRODUCTIONThe Semantic Web is a proposal oriented to represent

Web content in an easily machine-processable way. The ba-sic layer of the data representation for the Semantic Webrecommended by the World Wide Web Consortium (W3C)is the Resource Description Framework (RDF) [12]. TheRDF model is more than a simple relational structure; itsexpressivity turns more general the existential conjunctivefragment of first order logic by adding transitivity of somepredicates and inheritance axioms. From a database pointof view, it can be viewed as an extension of a representationsystem along the lines of naive tables without negation [1].

In this paper we concentrate on the problem of updatingRDF data. In the last two years the semantic web com-munity has shown an increasing interest in this problem.However, the existing proposals have so far ignored the se-mantic problems associated to the presence of blank nodesand of RDFS vocabulary with built-in semantics [15, 17, 22,14], and tackled the subject from a syntactical point of view.Related to the update problem in RDF, some studies haveaddressed changes in an ontology [13, 19], and more recently,


Supported by Millennium Nucleus Center for Web Research, GrantP04-67-F, Mideplan, Chile, and the ProjectProcesamiento y AnalisisSemantico de Servicios Web, Fondecyt No. 1050642.

the representation and querying of temporal information inRDF [6] has been also studied.

1.1 The Problem of Updates in RDFUpdates and Revision. The semantics of updates for data

models becomes difficult when the model turns –even slightly–more general than a simple relational structure [4]. Forknowledge bases, the abstract general problem of updat-ing is: what should be the result of changing a theory Twith a sentence ϕ? As Katsuno and Mendelzon [10] argued,the answer to this problem depends on the application athand. There is a fundamental distinction between update(now in a technical sense) and revision [11, 10]. Updatemeans bringing the knowledge base up to date when theworld described by it changes; revise means incorporatingnew information obtained about a static world. This dis-cussion is relevant when facing updates in the RDF model.Thus, the distinction between update and revision becomesof central importance. On the one hand, one of the maindesign goals of the RDF model is allowing distributed revi-sions of the knowledge base in the form of addition of in-formation in a monotonic way [20]. By some classic resultsof Gardenfors [3], the notion of revision becomes trivial inthis setting. On the other hand, when viewing RDF from adatabase point of view (i.e., huge but delimited repositoriesof metadata, like metadata for a library, for instance), thenotion of update becomes relevant. In this paper we con-centrate on this latter notion, and follow the approach ofKatsuno and Mendelzon [10].

Updates in RDF. Management, in particular maintain-ability, of RDF data needs a well defined notion of update.The problem becomes relevant since the standardization ofa query language for RDF [16]. We will show that the prob-lem of characterizing these changes in RDF is far from beingtrivial and raises interesting practical and theoretical issuesthat, so far, have been overlooked.

Consider for example the case of a web music store thatuses Semantic Web technology for making it easier to findinformation about artists depending on their music styles.This is a very dynamic environment, where artists and mu-sic styles are continuously being updated. Figure 1 showsa small portion of this web site, where sc means “subclas-sOf”, type indicates an instance of a class, and an edgebetween two nodes represents a triple of the form, for in-stance (a, sc, c). Suppose we want to delete all triples con-taining the value artist in Figure 1 (a). The result, clearly,is the one shown in Figure 1 (b), where dashed lines indicate

artist

J. Page

guitar player

performer

type

type

sc

sc

sc

artist

J. Page

guitar player

performer

type

type

sc

sc

sc

(b)(a)

Figure 1: Deleting all triples containing “artist”.

guitar player

performer

type

type

sc

sc

sc

(b)

artist

J. Page

guitar player

performer

type

type

sc

sc

sc

(a)

J. Page

artist

Figure 2: Deleting the triple (guitarplayer, sc, artist).

the deleted arcs and nodes. However, if we want to deletethe triple (guitarplayer, sc, artist), a reasonable semanticsfor this operation must ensure that the triple above can-not be deduced from the updated database. This semanticsyields two possible results, depicted in Figures 2 (a) and(b). Additionally, we have to decide what to do with thetriple (J.Page, type, artist): was it inserted directly, or wasdeduced from the triples (J.Page, type, guitar player) and(guitar player, sc, artist)? (see Section 3). In the formercase, it should stay; in the latter it should be deleted. Whatis to be done? Expressing the new scenario is beyond theexpressivity of RDF. One of the goals of this paper is togive a sound semantics for this operation. In this version,we concentrate on ground graphs (i.e., RDF graphs withoutblank nodes) and the operation of erase. In this direction,we characterize the formulas expressible in RDF which re-main logical consequences of a graph G after erasing from itanother graph H.

The paper is organized as follows. In Section 2 we dis-cuss related work. Section 3 reviews RDF concepts andpresents a formalization of RDF. In Section 4 we introduceour semantics for updates based on the Katsuno-Mendelzonapproach. Section 5 presents a characterization of erasingin RDF. Section 6 studies the complexity of the update anderase operations proposed. We conclude in Section 7.

2. RELATED WORKUpdates in knowledge bases and representation

systems. The semantics of an incomplete database (i.e.a relational databases containing incomplete information)is the set of all of its possible states. Updates are thendefined over this interpretation. Thus, a deletion wouldconsist in eliminating a tuple from every possible databasestate. Analogously, an insertion must be applied to all pos-sible states. The notion of representation system comes into determine the degree in which the system is capable ofexpressing the new state of the database. In short, a repre-sentation system is composed of a set of tables, a mapping

from tables to instances, and a set of allowed operations (likeinsertion, join, and so on). If the exact result of all allowedexpressions can be computed, we have a strong representa-tion system. Otherwise, we may limit to obtain approximateanswers (and we have a weak representation system). A re-sult by Imielinsky and Lipski [9] states that representationsystems based on naive tables (a relation containing vari-ables and constants) are weak for the standard relational op-erations not including negative selection nor set difference.In [1] this result is extended to consider updates. They showthat, for naive tables, adding the insertion operation yieldsa weak representation system. However, if Ω contains pos-itive selection, projection, and deletion, we do not have aweak representation system. This result is explained by thefact that naive tables do not handle disjunction. The con-clusion is that naive tables are adequate for querying butnot for updates. As RDF can be considered an extension ofa representation system based on the notion of naive tableswithout negation, we conclude that, in order to be appropri-ate for handling update and erase, RDF would need negationand disjunction.

Updates in graph databases. Updates have been alsostudied in the context of graph databases. This is relevantto our work because the RDF model is closely related tograph data models [2]. In particular, the Graph-based datamodel (GDM) and its update language GUL, introduced byHidders [8] are based on pattern matching. Two basic oper-ations are defined in GUL: addition and deletion. In the caseof deletion, there is a base pattern which contains a core pat-tern. The nodes, edges and class names that are not in corepattern are deleted for every matching of the base pattern.This approach is a promising line to implement in RDF thesemantic notions presented in this paper.

Updates in web databases: XML and RDF. XMLUpdates have been extensively addressed in the XML world.Tatarinov et al [18] proposed an XQuery extension that hasbeen the first step leading to the proposal currently understudy at the W3C [21]. The W3C specified the proper-ties required for update operators in XML. RDF Updateshave recently attracted the attention of the RDF commu-nity. Nevertheless, all proposals have so far ignored the se-mantic problems arising for updates associated to the exis-tence of blank nodes and the presence of RDFS vocabularywith built-in semantics. Sarkar [17] identified five updateoperators, also based on [18]. These operators are: Add,InsertAfter, Delete, Remove, and Replace, and presented al-gorithms for the Add and InsertAfter operations. Zhan [22]proposed an extension to RQL, and defined a set of updateoperators. Both works define updates in an operational way,and semantic issues are considered to a very limited extent.Another approach was proposed by Ognyanov and Kiryakov[15]. The main statement of this approach is that the twobasic types of updates in an RDF repository are the addi-tion and the removal of a statement (triple). Then, the workturns simply into a description of a graph updating proce-dure, where labels indicate a version of the graph at a certainmoment in time. Finally, Magiridou et al [14] recently pro-posed RUL, a declarative update language for RDF. Theydefine three operations, insert, delete and modify. The pro-posal is based on RQL and RVL. The main drawback of thiswork is that it does not consider blank nodes and schema

updates, i.e., the issues that raise the most interesting the-oretical issues. Leaving these issues out turns the problemtrivial. Thus, the authors basically end up dealing withchanges to instances of classes.

3. PROBLEM STATEMENT

3.1 Review of Basic RDF NotionsWe present here a streamlined version of RDF. The mate-

rial of this subsection can be found in [5] with more detail.There is an infinite set U (RDF URI references); an infi-

nite set B = Nj : j ∈ N (Blank nodes); and an infiniteset L (RDF literals). A triple (v1, v2, v3) ∈ (U ∪ B) × U ×(U ∪ B ∪ L) is called an RDF triple. In such a triple, v1

is called the subject, v2 the predicate and v3 the object. Weoften denote UBL the union of the sets U , B and L.

Definition 1. An RDF graph (just graph from now on) isa set of RDF triples. A subgraph is a subset of a graph. Theuniverse of a graph G, universe(G), is the set of elements ofUBL that occur in the triples of G. The vocabulary of G,denoted voc(G), is the set universe(G)∩ (U ∪L). A graph isground if it has no blank nodes. We also define the union oftwo graphs G1, G2, denoted G1 ∪ G2, as the set theoreticalunion of their sets of triples.

RDFS Vocabulary. There is a set of reserved words definedin the RDF vocabulary description language, RDF Schema–just rdfs-vocabulary for us– that may be used to describeproperties like attributes of resources (traditional attribute-value pairs), and also to represent relationships between re-sources. In this paper –following [5]– we will restrict to afragment of this vocabulary which represents the essentialfeatures of RDF. It is constituted by the classes rdfs:Class[class] and rdf:Property [prop], and by the properties rdfs:range [range], rdfs:domain [dom], rdf:type [type], rdfs: sub-ClassOf [sc] and rdfs:subPropertyOf [sp]. We present a se-mantics for this fragment, based on the following set of rules.

GROUP A (Subproperty)

(a, type, prop)

(a, sp, a)(1)

(a, sp, b) (b, sp, c)

(a, sp, c)(2)

(a, sp, b) (x, a, y)

(x, b, y)(3)

GROUP B (Subclass)

(a, type, class)

(a, sc, a)(4)

(a, sc, b) (b, sc, c)

(a, sc, c)(5)

(a, sc, b) (x, type, a)

(x, type, b)(6)

GROUP C (Typing)

(a, dom, c) (x, a, y)

(x, type, c)(7)

(a, range, d) (x, a, y)

(y, type, d)(8)

Definition 2 (Deductive System). Let G be a graph.For each rule r : A

Babove, define G `r G ∪ B iff A ⊆ G.

Also define G `s G′ iff G′ is a subgraph of G.Define G ` G′ if there is a finite sequence of graphs

G1, . . . , Gn such that (1) G = G1; (2) G′ = Gn; and (3)for each i, either, Gi `r Gi+1 for some r, or Gi `s Gi+1.

Definition 3. Let G be an RDF graph. The closure ofG, denoted cl(G), is the maximal set of triples G′ overuniverse(G) plus the rdfs vocabulary such that G′ containsG and G ` G′.

In the next section we will need the logical notion of amodel of a formula (an RDF graph). The model theory ofRDF (given in [7]) follows standard classical treatment inlogic with the notions of model, interpretation, and entail-ment (denoted |=). See [5] for details. Throughout thispaper we will work with Herbrand models, which turn outto be special types of RDF graphs themselves. For a groundgraph G, a Herbrand model of G is any RDF graph that con-tains cl(G) (in particular, cl(G) is a minimal model). From[5] the following results can be deduced.

Proposition 1. G |= H iff cl(H) ⊆ cl(G).

Theorem 1. The deductive system of Definition 2 is soundand complete for |=. That is, G1 ` G2 iff G1 |= G2.

3.2 The ProblemConsider the simplest problem related to the erase oper-

ation that we can find in RDF, and the associated semanticand complexity issues, namely: delete a tuple t or a set oftuples H, from an RDF graph G. To illustrate with a con-crete example, let G = (a, sc, b), (b, sc, c), and considerthe following problems:

Problem 1: Erase (a, sc, c) from G. Result: should (a, sc, c)be derivable from G after the deletion?. If not, should wedelete (a, sc, b) or (b, sc, c)?

Problem 2: Erase (a, sc, b) from G. Result: before dele-tion, (a, sc, c) was implicit in G (it was entailed by G).Should it still be in G after deletion?. Should deletion besyntax-independent?

Problem 3: Erase (a, sc, b), (b, sc, c) from G. Result:is it the empty set?. Either (a, sc, b) or (b, sc, c)?. Again,should (a, sc, c) be in the result?

A standard approach in KB is to ensure that, after dele-tion, the statement t should not derivable from G, and thatthe deletion should be minimal. The result should be ex-pressed by another formula, usually in a more expressivelanguage. For example, if in G above we erase (a, sc, c), the“faithful” result should be something like (a, sc, b)∨(b, sc, c).But the problem is that we do not have disjunction in RDF.

In this paper we explore the behavior of the Katsuno-Mendelzon approach to define a semantics for update inRDF and concentrate on the characterization of the eraseoperation and its consequences over the formulas expressiblein RDF. We will limit ourselves to study these questions forthe case of ground graphs.

4. SEMANTICS OF UPDATE AND ERASEIn this section we address the problem introduced in Sec-

tion 3.2. We characterize update and erase operations (i.e.,adding or deleting an RDF graph H to/from another RDFgraph G) using the Katsuno-Mendelzon approach, that is,identifying a theory with the set of models that satisfies it.

4.1 Katsuno-Mendelzon approach for RDFThe K-M approach to updates can be characterized as

follows from a model-theoretic point of view: for each modelM of the theory to be changed, find the set of models of thesentence to be inserted that are “closest” to M . The set ofall models obtained in this way is the result of the changeoperation. Choosing an update operator then reduces tochoosing a notion of closeness of models [4].

Definition 4. The operator , representing the update ofG with H, is defined as follows:

Mod(G H) =[

m∈Mod(G)

min(Mod(H),≤m), (9)

where min(Mod(H),≤m) is the set of models of H minimalunder ≤m, which is a partial order depending on m.

We will use the following notion of distance between mod-els, which gives us an order.

Definition 5 (Order). Let G, G1, G2 be models of RDFgraphs with voc(G) ⊆ voc(G2), voc(G1), and let G be a set ofmodels of RDF graphs. The symmetric difference betweentwo models G1 and G2, denoted as G1 ⊕G2, is (G1 \G2) ∪(G2 \G1). Then : (1) define a relation ≤G such that G1 ≤G

G2 (G1 is “closer” to G than G2) if and only if G1 ⊕ G ⊆G2 ⊕ G; (2) G1 is ≤G-minimal in G if G1 is in G, and ifG2 ∈ G and G2 ≤G G1 then G2 = G1.

4.2 The notion of UpdateWorking with positive theories, the problem of update

is fairly straightforward. The only concern is keeping theprinciple of irrelevance of syntax, i.e., the update should notdepend on the particular syntax of the sentences involved.

Theorem 2. Given the RDF graphs G and H, the updateof G with H, G H, is expressible as another RDF graph.Formally, m ∈ (G + H) if and only if m ∈ Mod(G H).

Proof. If m ∈ Mod(G + H) then m ∈ Mod(G) andm ∈ Mod(H). Then mG = m is the model in Mod(G) suchthat m is ≤mG -minimal in Mod(H). Then, m ∈ Mod(G H). Conversely, let m ∈ Mod(H) and mG ∈ Mod(G) suchthat m is ≤mG -minimal. Then mG ⊆ m: otherwise, (m ∪mG) <mG m, contradiction. Hence m |= (G + H).

Proposition 2. Let D, G, H be RDF graphs. Then, thedefinition of update satisfies the following statements: (1)D G |= G; (2) if D |= G then D G ≡ D; (3) if G1 ≡ G2

and H1 ≡ H2 then G1H1 ≡ G2H2 (irrelevance of syntax);(4) (D G) + H |= D (G + H); (5) if D G |= H andD H |= G then D G ≡ D H. (Note that these statementsare an analogous, in our setting, of the K-M postulates forupdate not involving disjunction).

4.3 The notion of EraseErasing statements from G means adding models to Mod(G).

Definition 6. The operator •, representing the erasure, isdefined as follows: for graphs G and H, the semantics ofG •H is given by:

Mod(G •H) = Mod(G) ∪[

m∈Mod(G)

min(((Mod(H))c,≤m)

(10)

and ( )c denotes complement. In words, the models of (G •H) are those of G plus the collection of models mH 6|= Hsuch that there is a model m |= G for which mH is ≤m-minimal among the elements of Mod(H)c. Compare identity(9).

Proposition 3. Let D, G, H be RDF graphs. Then, thedefinition of erase satisfies the following statements: (1)D |= D •G; (2) if D 6|= G then D •G ≡ D; (3) D •G 6|= G;(4) if G1 ≡ G2 and H1 ≡ H2 then G1 • H1 ≡ G2 • H2;(5) (D • G) + G |= D. (Note that these statements are ananalogous, in our setting, of the K-M postulates for erasenot involving disjunction).

Representing faithfully in the RDF language the notionsof update and erase defined above is not possible in thegeneral case. The Update operator presents no difficulties,and it is in fact an RDF graph (formula). However, theErase operator presents problems, arising from the fact thatwe have neither negation nor disjunction in RDF.

5. CHARACTERIZING DELETION IN RDFThe following notion is the key to obtain a workable char-

acterization of erase (expressed previously only in terms ofsets of models), based on the behavior over the formulasexpressible in RDF.

Definition 7 (Erase Candidates). Let G and H be RDFgraphs. Then the set of erase candidates of G and H, de-noted ecand(G, H), is defined as the set of maximal sub-graphs G′ of cl(G) such that G′ 6|= H.

Proposition 4. Let G, H be RDF graphs. If m 6∈ Mod(G)and m ∈ Mod(G•H), then there is a unique E ∈ ecand(G, H)with m |= E.

Proof. Let m 6|= H and mG |= G such that m is ≤mG -minimal. Consider the subgraph E = (m ∩ cl(G)). Clearlym |= E, and hence E 6|= H. Claim: E ∈ ecand(G, H).Assume E is not maximal with the property of not entailingH. Then there is t ∈ (cl(G) \ E) with E ∪ t 6|= H. Thenconsider m′ = cl(m∪t). We have that m′ 6|= H and m′ <mG

m, a contradiction. The uniqueness of E follows from itsmaximality.

Proposition 4 states that ecand(G, H) defines a partitionin the set of models defined by G •H, and each such set is“represented” by the RDF graph E. Note that the smallerthe size of ecand(G, H), the better the approximation toG •H of each element in ecand(G, H), being the limit The-orem 3:

Theorem 3. If ecand(G, H) = E, then (G •H) ≡ E.

We are ready for the theorem characterizing the RDF sub-graph of cl(G) which captures exactly all consequences ofG •H expressible in RDF:

Theorem 4. For all formulas F of RDF,T

ecand(G, H) |=F if and only if Mod(G •H) ⊆ Mod(F ).

The proof follows from Proposition 4.

5.1 Computing Erase CandidatesFrom the discussion above, it follows the relevance of com-

puting erase candidates to approximate G•H. We will needthe notion of proof sequence based on the deductive systemfrom Section 3.

Definition 8 (Proof Sequence). Let G, H be RDF graphs.Then a proof sequence of H from G is a sequence of RDFgraphs H1, . . . , Hn such that:

1. H1 ⊆ G and H ⊆ Hn.

2. For each pair Hi+1 and Hi one of the following holds:

(a) (Standard rules) Hi+1 = Hi ∪ t, for t1, t2 ∈ Hi

and t1 t2t

is the instantiation of a rule (see rulesin Secc 3).

(b) (Mapping rule) µ(Hi+1) = Hi for a mapping µ.

Because of Theorem 1, proof sequences are sound andcomplete for testing entailment.

The first element in a proof sequence P will be calledbase(P ). base(P ) is a minimal base for the graphs G, H iffit is minimal under set inclusion among the bases of proofsof H from G, that is, for every proof P ′ of H from G,base(P ) ⊆ base(P ′). We refer to the set of minimal basesof G, H as minbases(G, H).

We use the following notion of a cover for a collection ofsets. A cover for a collection of sets C1, . . . , Cn is a set Csuch that C ∩ Ci is non-empty for every Ci.

Lemma 1. Let G, H be RDF graphs. C is a cover for theset minbases(G, H) iff (G \ C) 6|= H.

Proof. (If) If C is not a cover, then there is a minimalbase B ∈ (G \ C). Then there is a proof P for H fromG\P , where base(P ) = B, contradicting that (G\C) 6|= H.(Only If) Suppose not. Then there is a proof P for H fromG \ C. We have that there is no minimal base B such thatB ⊆ base(P ). Hence base(P ) is a minimal base for G, H,contradicting that C is a cover for all minimal bases.

Theorem 5. Let G, H, D be RDF graphs. Then C is aminimal cover for the collection of sets minbases(G, H) iff(i) (G \C) 6|= H and (ii) G \C is a maximal subgraph G′ ofG such that G′ 6|= H.

Proof. Follows from Lemma 1. It can be easily verifiedthat the minimality of C implies the maximality of G \ Cand vice versa.

Corollary 1. Let G, H, D be RDF graphs. E ∈ ecand(G, H)if and only if E = cl(G) \ C for C a minimal cover for thecollection of sets minbases(cl(G), H).

6. COMPLEXITYIn this section we study the complexity of computing an

erase operation (computing update is straightforward). Weshow that computing erase candidates reduces to findingcuts in a class of directed graphs that encode RDF graphs.

Finding erase candidates reduces to compute RDF graphswe call delta candidates. We denote dcand(G, H) the set ofRDF graphs (cl(G)\G′) : G′ ∈ ecand(G, H). Each of thegraphs in dcand(G, H) will be called a delta candidate forG, H. Notice that the delta candidates can be alternativelydefined as minimal graphs D ⊆ cl(G) such that (cl(G)\D) 6|=H.

6.1 Minimal CutsWe will need the following standard notation related to

cuts in graphs. Let (V, E) be a directed graph. A set of

edges C ⊆ E disconnects two vertices u, v ⊆ V iff eachpath from u to v in the graph passes through a vertex inC. In this case C is called a cut. This cut is minimal ifthe removal of any node from C does not yield another cut.We also generalize cuts for sets of pairs of vertices yieldingmulticuts. A minimal multicut for a set of pairs of nodes(u1, v1), (u2, v2), . . . , (un, vn) is a minimal set of edges thatdisconnects ui and vi. Given a graph G and a set of pairsof nodes N , we denote by MinCuts(N, G) the set of minimalmulticuts of N in G. Notice that when N has a single pairMinCuts(N, G) is a set of cuts.

We will show that, in general, an element in dcand(G, H)is the union of two cuts: one defined in a directed graph wewill denote G[sc], and the other in a graph denoted G[sp].

Given an RDF graph G, denote by G[sc] = (N, V, λ) thelabeled directed graph defined in Table 1. For each triple ofthe form specified in the first column of the table, we havethe corresponding edge in V . The set of nodes N consists ofall the nodes mentioned in the edges given in the table. Thefunction λ : E → G maps each edge in E to a triple in G,according to Table 1. The labeled directed graph G[sp] isdefined similarly in Table 1. As notation, we use the lettersn and m to refer distinctly to nodes in G[sc] and G[sp],respectively.

Triple Edge in G[sc](a, sc, b) (na, nb)(a, type, b) (na, nb)(a, type, class) (na, na)

Edges in G[sp](p, sp, q) (mp, mq)(a, p, b) (ma,b, mp) (mb,a, mp)(p, dom, c) (mp, mc,dom)(p, range, c) (mp, mc,range)

Table 1: Description of the graphs G[sc] (above) andG[sp] (below).

For an RDF triple t we define a set of pairs of nodes thatspecified the cut problems related to the erase of the triplet from an RDF graph G. The set t[sc, G] will contain pairsof nodes in the graph G[sc] and the set t[sp, G] will containpairs of nodes in G[sp]. Formally, we denote by t[sc, G]the pairs of nodes (u, v), u, v nodes in G[sc] as describedin Table 2 (second column). Analogously, we define t[sp, G]using Table 2 (third column). As an example, for a tripleof the form (a, sc, b) in a graph G, (a, b, c)[sc, G] containsthe single pair of nodes (na, nb), where both nodes na, nb

belong to G[sc]. Notice that there is always a single pair ofnodes in t[sc, G], and the only case where t[sc, G] may haveseveral pairs of nodes is when t is of the form (a, type, b).

For an RDF graph U , U [sc, G] is the union of the setsti[sc, G], for the triples ti in U .

6.2 Complexity of EraseFor the sake of space, we will present here the case where

the graph to erase has a single triple. Our results can be eas-ily generalized to computing erase candidates ecand(G, H)for the case where H has several triples.

A delta dcand(G, t), will be defined with two sets of graphs,denoted dcandsc(G, t) and dcandsp(G, t). For each D ∈dcand(G, t), D = D1 ∪ D2, for of any two RDF graphs

Triple t ∈ G t[sc, G] t[sp, G](a, sc, b) (na, nb) –(a, sp, b) – (ma, mb)(a, p, b) – (mab, mp)(a, type, c) (na, nc) pairs (ma,x, mc,dom) for all x

pairs (mx,a, mc,range) for all x

Table 2: Pair of nodes t[sc, G] and t[sp, G] associatedto a triple t in a graph G.

D1 ∈ dcandsc(G, t) and D2 ∈ dcandsp(G, t).

Proposition 5. Let G be an RDF graph, G′ = cl(G), andconsider a triple t. The following holds: (i) dcandsc(G, t) =MinCuts(G′[sc], t[sc, G′]); (ii) dcandsp(G, t) = MinCuts(G′[sp],t[sp, G′]).

Proof. (Sketch) Corollary 1 can be expressed in termsof delta candidates as follows. Let G, H, D be RDF graphs.Then D ∈ dcand(G, H) iff D is a minimal cover set forminbases(cl(G), H).

We sketch the proof for the case where t is of the form(a, sc, b). In this case we can verify that minbases(G′, H)corresponds to the RDF triples associated to the simplepaths (paths with no cycles) from na to nb in G[sc]. There-fore, it follows that the minimal cuts MinCuts(G′[sc], t[sc, G′]are exactly the delete candidates dcand(G, t). Notice thatin the case where t is of the form (a, sc, b), dcand(G, t) =dcandsc(G, t), because, in this case dcandsp(G, t) is empty.

Theorem 6. Let G, H be ground RDF graphs, and t bea ground a triple. The problem of deciding whether E ∈ecand(G, t) is in PTIME.

Proof. (Sketch) From Proposition 5, the problem re-duces to determine if D = cl(G) \ E is a delta candidatein dcand(G, t). Let G′ = cl(G), G′ can be computed inpolytime. The triples in D yield two sets of edges dcandsc

and dcandsp in the graphs G′[sc] and G′[sp], respectively.Thus we have to test (i) whether t[sc, G′] is a minimal cutin G′[sc] and (ii) whether t[sp, G′] is a minimal (multi)cutin G′[sp]. In both cases the test can be done in PTIME bysimple reachability analysis in the graphs G′[sc] and G′[sp],respectively. Testing whether a set of edges S is a mini-mal cut for (v1, u1) in a graph GR = (V, E) can be doneby performing polytime reachability analysis in the graphas follows. To test whether S is a cut, delete from E theedges in S, and test whether v1 reaches u1 in this new graph.To test minimality, do the same test for each set of edgesS′ ⊂ S resulting from removing a single edge from S. S isminimal iff all of the S′s are not cuts. We proceed similarlyfor testing if a set of edges is a minimal multicut.

7. CONCLUSIONSIn this paper we considered an RDF database as a knowl-

edge base, and treated the problem of updating the databasein the framework of the traditional proposals of knowledgebase updating. We characterized the update of a graph Gwith a graph H within the framework of the K-M approach,and defined the meaning of the update and erase operationsin RDF on a solid foundation (considering, in the latter case,that we do not have negation nor disjunction in RDF). Wealso provided algorithms for calculating these operations, in-cluding a detailed complexity analysis. In future work we

will develop an update language for RDF, and extend ourstudy to more expressive languages, like OWL.

8. REFERENCES[1] S. Abiteboul and G. Grahne. Update semantics for

incomplete databases. In International Conference on VeryLarge Databases(VLDB’85), Stockholm, Sweden, 1985.

[2] R. Angles and C. Gutierrez. Querying RDF data from agraph database perspective. In European Conference on theSemantic Web (ECSW’05), pages 346–360, 2005.

[3] P. Gardenfors. Conditionals and changes of belief. ActaPhilosophica Fennica, Vol. XX, pages 381–404, 1978.

[4] G. Grahne, A.O. Mendelzon, and P. Z. Revesz.Knowledgebase transformations. Journal of Computer andSystem Sciences, Vol 54(1), pages 98–112, 1997.

[5] C. Gutierrez, C. Hurtado, and A.O. Mendelzon.Foundations of semantic web databases. In 23rd.Symposium on Principles of Database Systems, pages95–106, 2004.

[6] C. Gutierrez, C. Hurtado, and A. Vaisman. Temporal RDF.In European Conference on the Semantic Web (ECSW’05)(Best paper award), pages 93–107, 2005.

[7] Patrick Hayes(Ed.). RDF semantics. W3C Working Draft,October 1st., 2003.

[8] A.J.H. Hidders. A graph-based update language forobject-oriented data models. Doctoral Thesis, TechnischeUniversiteit Eindhoven, The Netherlands, 2001.

[9] T. Imielinski and W. Lipski. Incomplete information inrelational databases. Journal of ACM, 31(4), pages761–791, 1984.

[10] H Katsuno and A. O. Mendelzon. On the difference betweenupdating knowledge base and revising it. In InternationalConference on Principles of Knowledge Representation andReasoning, pages 387–394, Cambridge, MA, 1991.

[11] A.M. Keller and M. Winslett. On the use of extendedrelational model to handle changing incompleteinformation. IEEE Trans. on Software Engineering,SE-11:7, pages 620–633, 1985.

[12] O. Lassila and R.(Eds.) Swick. Resource descriptionframework (RDF) model and syntax specification. W3CWorking Draft, 1998.

[13] A. Maedche, B. Motik, L. Stojanovic, R. Studer, andR. Volz. Establishing the semantic web 11: Aninfrastructure for searching, reusing, and evolvingdistributed ontologies. In International Conference onWorld Wide Web, pages 439–448, 2003.

[14] M. Magiridou, S. Sahtouris, S. Christophides, andM Koubarakis. RUL: A declarative update language forRDF. In International Semantic Web Conference, pages506–521, 2005.

[15] D. Ognyanov and A. Kiryakov. Tracking changes in rdf(s)repositories. In EKAW’02, pages 373–378, Spain, 2002.

[16] E. Prud’Hommeaux and A. Seaborne (Eds.). SPARQLquery language for rdf. W3C Working Draft, July, 2005.

[17] S. Sarkar and H.C. Ellis. Five update operations for rdf.Rensselaer at Hartford Technical Report, RH-DOES-TR03-04, 2003.

[18] I. Tatarinov, G. Ives, A. Halevy, and D. Weld. UpdatingXML. In Proceedings of ACM SIGMOD Conference, pages413–424, Santa Barbara, California, 2001.

[19] U. Visser. Intelligent information integration for thesemantic web. Lecture Notes in Artificial Intelligence(3159), 2004.

[20] World Wide Web Consortium. RDF semantics, 2004.http://www.w3.org/TR/rdf-mt.

[21] World Wide Web Consortium. XQuery Update FacilityRequirements (working draft), 2005.http://www.w3.org/TR/2005/WD-xquery-update-requirements-20050603/.

[22] Y. Zhan. Updating RDF. In 21st Computer ScienceConference, Rensselaer at Hartford, 2005.

Amoeba Join: Overcoming Structural Fluctuations inXML Data

Taro L. SaitoUniversity of TokyoJSPS Research Fellow

[email protected]

Shinichi MorishitaUniversity of Tokyo

[email protected]

ABSTRACTThere are no universal rules for organizing data in XML.Consequently, semantically identical XML documents mayhave different structures; we call this structural fluctuationin XML. Finding all the structural fluctuations in an XMLdocument requires verbose path expression queries. To over-come this problem, we developed a novel query processingprimitive, called amoeba join. Amoeba join does not re-quire explicit path structures in query statements; tag namesor keywords are sufficient to perform searches. This paperintroduces several amoeba join processing algorithms anddemonstrates their performance.

1. INTRODUCTIONXML is now a global standard for describing structured

data. In 2005, many vendors, including the Big Three sup-pliers of relational databases (IBM, Microsoft, and Oracle),launched new XML database engines. This trend will cer-tainly result in increased XML capability, not only as a textformat, but also as data stored in database management sys-tems (DBMSs). The potential handling size and capacity ofXML data is huge. Nevertheless, inconveniences have al-ready materialized during the evolution toward this reality.Before the databases are explored using queries, it is difficultto find target elements because such large XML databaseshave complex and unclear path structures. In addition, itis difficult to write a query without knowledge about pathstructures.

A summary of path structures such as DataGuides [3]shows all existing paths in an XML database, but this isnot sufficient to comprehend the actual structure of datain the target context; a path occurring in one context maynot appear in a different context. An XML schema resolvesthis uncertainty in path occurrence to some extent, but notentirely. Since the XML schema allows the optional appear-ance of elements, unlike schemata in relational databases,path structures may still vary depending on context.

An XML document without a schema is like a black boxfor the user, but writing path queries for specific contextsis very difficult. In contrast, relational databases requireschemata, making it considerably easier to find tables of therequired data. Therefore, considerable effort and intensiveresearch has been put into XML structure indices, allow-ing the processing of descendant axis queries that require

Copyright is held by the author/owner.Ninth International Workshop on the Web and Databases (WebDB 2006)June 30, 2006, Chicago, Illinois

less structural knowledge. However, this trend might notbe addressing the real goal. People are enthusiastic aboutquerying data structures, when they should be focusing onstructured data. If we pursue writing paths in order to per-form queries, we must first somehow acquire knowledge ofpath structures. To find the path structure for some specificcontext, we must issue queries to a ’black box’ database.This is a chicken-or-egg situation — which comes first, thepath structure or the query?

One way to overcome this problem is by relaxing XPathqueries [1]. For example, the XPath query org/manager canbe relaxed to org//manager by replacing the parent-child axiswith the ancestor-descendant axis. This process reduces theburden of writing exact path query matches. However, thefollowing example illustrates a problem not normally iden-tified in the context of query relaxation:

<org department="head office"><manager>David</manager><location>Tokyo</location>

</org>

⇒<manager person="David"><org department="head office">

<location>Tokyo</location></org>

</manager>

Figure 1: An example of structural fluctuation

The two XML fragments shown above (Figure 1) repre-sent data with the same meaning, but with different struc-tures: the hierarchical order of org and manager tag is re-versed. We call this structural fluctuation in XML. It is astructural variation in XML fragments that have the sameelements (e.g., org and manager) and different structures.

XPath can track structural fluctuations, using disjunc-tion in path patterns. For example, finding element pairsof org and manager in Figure 1 requires the concatenationof at least two types of XPath query; /org/manager and/manager/org. In general, however, query statements becomemore complex because there could be many more elementsto query and structural fluctuations in the document. Thus,the number of XPath expressions required to cover all pos-sible path structures can easily balloon. For example, thenumber of query trees required to cover all structural fluc-tuations consisting of org, manager, and location elements is33−1 = 9 (Figure 2) because it is identical to the enumera-tion of all labeled trees with n nodes, when the differencesin axis (// or /) are ignored. Its enumeration size is knownto be nn−1. Concatenating all nn−1 query trees into a singleregular path expression can be a daunting task.

Our research was motivated by this inconvenient methodof path expressions. In this paper, we introduce the no-tion of an amoeba, which represents an equivalent class of

org

manager location

org

manager

location

org

location

manager

manager

org location

manager

org

location

manager

location

org

Figure 2: An amoeba (org, manager, location) coversnn−1(32 = 9) structural fluctuations.

structural fluctuations. An amoeba (org, manager, location)

groups XML fragments that match one of the query treesillustrated in Figure 2. Applying this notion of an amoeba,we devised a novel query-processing method, amoeba join,which makes it possible to query XML databases withoutexplicitly specifying path structures; tag names (and key-words) are sufficient to perform searches.

Even when using a schema or DataGuides [3], learningthe entire XML data path structure is more difficult thancreating a list of all types of tag and attribute names. We in-vestigated a benchmark XML document provided by XMark[8] (scalability=1.0, 114 MB). The document contained 83tags and attribute names and 548 distinct paths. Therefore,database users should have much more information on tagsand attributes than path structures, which may differ de-pending on context. This is why query processing withoutexplicit path structures, which is achieved by amoeba join,is promising.

This paper makes the following contributions:

• It introduces amoeba join as a method to capture struc-tural fluctuations in XML data without explicitly spec-ifying them using path queries.

• It presents three essential amoeba join processing al-gorithms and their experimental evaluations.

Semantics of XML StructureHere, we demonstrate that XML structure provides surpris-ingly few semantics, clarifying the need to handle structuralfluctuations in XML. First, consider the encapsulation ofdata with a tag. This process is normally used to groupdata elements or text data. In XML, it inevitably leadsto a structural hierarchy among the data elements, whichmay or may not express high and low ranks. The followingXML example (Figure 3) represents organization data withboth superficial and semantic hierarchy order between themanagers David and Michael:

<org department="head office"><manager> David </manager><org department="R&D">

<manager> Michael </manager></org>

</org>

Figure 3: Nested organization data

It is also possible to reverse the hierarchical order. In thefollowing example, the belongs to tag is used to switch thehierarchical positions of the managers David and Michaelwithout losing the semantic relationship:

<org department="R&D"><manager> Michael </manager><belongs_to>

<org department="head office"><manager> David </manager>

</org></belongs_to>

</org>

Furthermore, when a tag is used to group elements, thereare generally no semantic ranks among the elements, as thestructural change of org and manager in Figure 1 illustrates.The org element has the manager information, and vice versa.

Therefore, hierarchical order does not directly representthe semantic relationship between data elements; seman-tic relationships become clear only when they are explicitlygiven. Consequently, it is natural to assume that XML datawith neither explicit semantics nor any schema may containsome structural fluctuation. In our proposed method, weassume that XML databases contain arbitrarily structuredinformation, and the user picks up node tuples matchingan amoeba. Then, the retrieved data is transformed into aformat designated by the the user.

The rest of this paper is organized as follows: Section2 discusses the essential differences between the proposedmethod and other related studies. Section 3 introduces thenotion of amoeba and amoeba join, and Section 4 presentsseveral amoeba join processing algorithms. Section 5 demon-strates the performance of these queries. Finally, Section 6presents our conclusions and directions for future work.

2. RELATED WORKQuerying an XML database without knowledge of path

structure was first addressed by [7], and refined by [11]. Bothstudies used variations of the least common ancestor method(lca) to find the smallest tree containing all target nodes.Among the lca nodes that connect common node sets (tagsor keywords), the one that forms the smallest subtree isdefined as the smallest least common ancestor (slca) [11].The precise definition of slca is as follows: given k nodesets D1, D2, . . . , Dk, for example, D1 and D2 are node setsmatching XPath //org, //manager, a node v belongs to theslca if v ∈ lca(D1, . . . , Dk) and for all u ∈ lca(D1, . . . , Dk),v is not an ancestor of u. In summary, a subtree rooted froman slca node does not contain other lca nodes.

One problem with this approach is that the slca mightbe the root node of an XML document. XML is a singlerooted tree, so every node set can be connected using theroot node. In addition, when the slca approach is appliedto the previous example (Figure 3) to find pairs of org andmanager elements, it misses the pair of org and manager Davidbecause these contain the subtree rooted by the slca of org

and manager Michael. In general, XML data semantics aretoo complex to be detected automatically using simple rules.In addition, although the method of [11] is optimized tosearch for slca nodes, it focuses mainly on keyword versusdatabase queries. It cannot detect element inclusion rela-tionships. For example, it can find the keyword “Michael”,but is not capable of assuring that “Michael” is containedwithin the manager tag.

XRank [5] applies keyword-based search to XML. It lo-cates XML elements that contains all given keywords. Un-like slca, XRank is aware of recursion of XML structure.However, it suffers from two drawbacks: (1) it does not dis-tinguish tag name from textual content; (2) it cannot expresscomplex query semantics [7].

Finding an exact match in XPath queries can be diffi-cult, and thus studies have investigated ways to relax thecondition of rigorous matching in regular path expressions[1]. The types of relaxation are explained in [1]. These in-clude dropping or weakening predicates or query nodes, and

company

org

department

"R&D"

manager belongs_to

"Michael"

manager

orgname

"head office""David"

manager

department

"HR"

org

"Lucy"

location

"Tokyo"location

"Tokyo"

name

1

2

3

4

5

6

7

8

9

10 12

13

14

15

11

16

17

18

20

21

22

19

Figure 4: Amoeba join (org, manager, location = “Tokyo”)

adding an explicit disjunction, which is similar to query-ing all structural fluctuations. The proposed amoeba joinmethod contains the essence of query relaxations, but isnovel in that it is also able to handle situations in whichthe high and low nodes of a query tree are reversed.

DogmatiX [10] attempts to solve structural fluctuationsusing nearest neighbor heuristics that connect nodes withinsome metrics. However, the method cannot address all pos-sible nn−1 structural fluctuations.

Approximate join [4] locates documents with similar struc-tures and different forms. It is more general than amoebajoin because it includes changes in tag names. Althoughapproximate join can accommodate various similarity mea-sures, it is optimized to tree edit distances, which must pre-serve ancestor node order in a query; that is, unlike amoebajoin, it cannot reverse ancestor-descendant relationships.

Static typing of XML [2] is another way to handle struc-tural fluctuation. It detects mismatches between paths de-scribed in query statements and schemata. Such discrepan-cies are reported as compile-time (static) errors of the query.This prevents writing invalid queries that do not match anydocument path. In other words, it is not necessary to coverall path possibilities because the query compiler presents theavailable paths. A major drawback of this approach, how-ever, is that it still requires a schema, which is not manda-tory in XML.

3. AMOEBA JOINThe requirement of path structures within query state-

ments is a serious obstacle to using XML DBMSs. Amoebajoin is a method for overcoming this; problem by allowingstructural fluctuations in the underlying XML database, andretrieving data matching the query of interest. This sec-tion presents a definition of amoeba and our novel query-processing method, amoeba join.

DEFINITION 1. [Amoeba Join] Let D1, . . . , Dk be do-mains of XML nodes; then each Di(1 ≤ i ≤ k) repre-sents a node set matching an XPath expression, an amoebajoin AJ(D1, . . . , Dk) gives a set of node tuples (d1, . . . , dk)|d1 ∈ D1, . . . , dk ∈ Dk, such that one of d1, . . . , dk is a com-mon ancestor of the others.

We call a node tuple t = (t1, . . . , tk) an amoeba (or anamoeba tuple) if t ∈ AJ(D1, . . . , Dk). Its common ancestortr(1 ≤ r ≤ k) is called the amoeba root of t. See Figure

4 for an example of amoeba join using a tree representa-tion of an XML document. When D1 represents a nodeset belonging to a path expression ’//org’, D2 = ’//man-

ager’ and D3 = ’//location/text()=”Tokyo”’, respectively, anamoeba join AJ(D1, D2, D3) gives a set of node ID tuples

(2, 5, 13), (2, 8, 13), (21,15, 17) (bold numbers representamoeba roots). This notion of an amoeba can be adaptedto various XML structures. It can capture the manager node(8) even if it is under the belongs to tag (7), and it also trackslocation node (17) behind the department tag (16). Further-more, amoeba join detects hierarchical change of orders be-tween org (21) and manager (15) nodes.

Note that amoeba join is not a process of computing theleast common ancestors (lca) of a given node set. The lcanodes of org, manager and location nodes in Figure 4 includethe root node, company (id = 1). Every node in XML canreach the others through the root. For example, node 13and 15, which are apparently irrelevant, can be connectedvia the root. Therefore, the lca method is not appropriatefor finding relationships between nodes. Amoeba join is sim-ilar to the lca method in that it finds a common ancestor;however, it limits common ancestor nodes to those belong-ing to one of the given query domains. By using this rule,the relationships among nodes are bound to a common an-cestor, i.e., the amoeba root. The tuple (11, 8, 13) in Figure

4 is not an amoeba because its nodes are not bound, whilethe other amoeba tuples are bound by 2 or 15. If there is nosuch bound, as in the lca method, the relationships amongthe connected nodes are very weak.

When the root node of XML happens to be contained inone of the domains, any node tuple becomes an amoeba. Ingeneral, such a query is no use. If the root node is requiredin order to specify the context of queries, ameba join withcontext nodes AJC(context nodes, D1, . . . , Dk) is preferable.It restricts the search region of the query under the specifiedcontext nodes.

The result of AJ(org, manager, location = ”Tokyo”) in Fig-

ure 4 has node overlaps in (2, 5, 13) and (2, 8, 13). Theydiffer in manager nodes 5 and 8. Intuitively, the tuple of(2, 5, 13) is preferable as a query result. However, if the userwants to list all of the managers at the Tokyo location, thepair (2, 8, 13) is also meaningful. Therefore, filtering the re-sults should be left to the user. If the user wants nodes thatare close to each other in the tree structure of XML, thenearest-neighbor measure method can be used to filter theresults. This filtering encounters problems in detecting thesemantics of the XML structure, but this issue should bediscussed separately, with the handling of structural fluctu-ation. Amoeba join is a query-processing technique to beemployed before applying such semantic or heuristic filters.

Amoeba Join SyntaxHere, we introduce syntax of amoeba join expressions:

S := “(” E (, E)* “)”E := F | $variable := F | SF := XPath-expr (P <value>)? | <value>P := ⇒ | = | 6= | < | > | ≤ | ≥

To make expressive queries for extracting valuable informa-tion, we extended amoeba join to incorporate XPath expres-sions. For example, an amoeba join expression (org, manager)collects all pairs (amoebas) of org and manager elements thathave ancestor-descendant relationships. Amoeba join canalso be used with explicit path queries. (/org, ”David”) com-putes amoebas that have org nodes directly under the rootnode, and text nodes with a value of David. To express anode x such that a text y occurs in the subtree rooted at x,we provide the statement x ⇒ y. For example, the expres-

sion manager⇒ ”David” designates the manager tag containingthe text node ”David”, whether it is a child or descendantnode of the manager node.

We also offer another operation that allows nodes to bebound to variables. For example, ($x = org, $x/manager, lo-

cation) joins org nodes and its child manager nodes to location

nodes.

4. AMOEBA JOIN PROCESSINGAmoeba join processing locates node tuples composing

amoebas, from given node domains (D1, . . . , Dk). There-fore, amoeba join processing is independent of node retrievalfrom databases. This independence is important because itenables amoeba join to be incorporated into other existingquery-processing techniques.

In the algorithm descriptions that follow, we assume thatevery XML node is labeled with an interval (start, end) [6].A pair of two arbitrary intervals is disjoint; one subsumesthe other as a subrange. By encoding XML tree structure hi-erarchy in the form of an interval tree, detecting of ancestor-descendant relationships between two nodes becomes a con-tainment test of two intervals, i.e. a node vi is an ancestorof another node vj iff vi.start < vj .start ∧ vj .end < vi.end.

First, we describe a process to determine whether a givennode tuple is an amoeba. The function isAmoeba(t) receives anode tuple t = (t1, . . . , tk), and returns true if it finds a nodeinterval in t with the smallest start value that completelycontains the other intervals. Such an interval is the commonancestor of the others; i.e., this node tuple constructs anamoeba.

Brute-force Amoeba JoinWith the decision function isAmoeba(t), we can write a simplebrute-force amoeba join processing algorithm (Algorithm

1). This brute-force version computes all permutations ofthe input sets, but is apparently inefficient.

Algorithm 1 Brute-force Amoeba Join Algorithm

Input: Node sets D = (D1, . . . , Dk)Output: A set of amoebas R1: R ⇐ nil2: for all node tuple t in the permutation of D do3: if isAmoeba(t) then4: push t into R5: end if6: end for7: return R

Two more efficient amoeba join algorithms are detailedbelow. The sweep algorithm improves the brute-force algo-rithm by sequentially sweeping the input node sets. Thequicker algorithm reduces disk I/Os by localizing search re-gions.

Sweep Algorithm of Amoeba JoinBy sorting the input node sets in advance and in the or-der of their start values, it becomes more efficient to findamoebas because the amoeba root of an amoeba (t1, . . . , tk)always has the smallest start value in t1, . . . , tk. The sweepalgorithm (Algorithm 2) searches amoeba root nodes bysweeping the sorted input node sets.

In Step 7 of Algorithm 2, a node ts with the the smallestvalue in the input sets is assumed to be an amoeba root.Because no other element in the input sets has a smaller

start value than ts, scanning the range of (ts.start, ts.end)in Dj(1 ≤ j ≤ k, j 6= s) is sufficient to find all descendantnodes of ts (Step 10). Then using these descendant nodesand ts, we can enumerate all amoeba tuples rooted by ts

(Step 17). When the algorithm reaches Step 14, it is assuredthat all amoebas whose root’s start value is smaller than orequal to the current amoeba root candidate ts are found.

Algorithm 2 Sweep Amoeba Join Algorithm

Input: Sorted node sets D = (D1, . . . , Dk)Output: R: a set of amoebas1: R ⇐ nil.2: while true do3: if some of D1, . . . , Dk is empty then4: return R // no more amoeba tuples5: end if6: create a node tuple t = (t1, . . . , tk) from

(D1.front, . . . Dk.front)7: Let s be the smallest start node index in t, then ts is the

smallest node in D1, . . . , Dk

8: if isAmoeba(t) then9: // s is the amoeba root node index in t10: By searching the range of (ts.start, ts.end) in each Dj(1 ≤

j ≤ k, j 6= s), collect descendant nodes of ts, then constructa set of these nodes Aj .

11: As = ts // contains only the current amoeba root12: If every Aj(1 ≤ j ≤ k) is not empty, all permutations of

(A1, . . . , Ak) construct amoeba tuples, so insert them intoR.

13: end if14: remove ts from Ds // all amoebas rooted by ts are found15: end while

Heuristics for Search Space ReductionHere, we introduce the quicker algorithm, a more elaborateversion of amoeba join, which is integrated with index look-ups. While the sweep algorithm reads all nodes in the givenquery domains from the database, the quicker algorithm(Algorithm 3) tries to reduce this disk I/Os.

For a node tuple to be an amoeba, each node in the tu-ple must be a descendant of the amoeba root node. Whena node v is considered a part of an amoeba, its amoebaroot is either v or one of its ancestor nodes. Figure 5 illus-trates this idea of localizing database scans within the de-scendant area of an amoeba root node candidate. Given apivot node, which is considered a component of an amoeba,the quicker algorithm in Step 5 finds its ancestor nodes,i.e., amoeba root candidates, then searches the descendantarea for other components of amoeba tuples. The quickeralgorithm chooses pivot nodes from the smallest domain,namely Di, because the smaller the cardinality |Di|, thefewer amoeba root candidates and their descendant nodes(components of amoebas).

For this purpose, we use the frequency count (or its esti-mation) of nodes belonging to each of the query domains.Given domains of an amoeba join query (D1, . . . , Dk), letE = (e1, . . . , ek) be frequency of D1, . . . , Dk. When thevalue of |Di| is available, ei = |Di|, if not, ei = ∞. A func-tion f sorts ei so that ef(1) ≤ · · · ≤ ef(k). Quicker algorithmchooses pivot nodes from Df(1) (Step 4).

The quicker algorithm (Algorithm 3) utilizes three typesof index scan; for retrieving nodes in Df(1), which is thesmallest domain (Step 2); for retrieving ancestor nodes ofa pivot node (Step 5); and for scanning descendant nodesof an amoeba root candidate (Step 11). A database indexthat supports these three types of index scans is required toperform the quicker algorithm.

pivot

search space

amoeba root amoebaXML root

( , , )

pivot

search space

Figure 5: A small number of pivot nodes helps toreduce index scan ranges.

This type of search space reduction (Figure 5) is not avail-able in the lca method, because a lca node tends to be theroot of XML; it does not reduce the search space at all.Another reason that makes this optimization possible is thedesign concept of amoeba join, which tries to find commonancestor nodes from specific domains, while the approach ofthe lca or slca [11, 7] is to find common ancestors from theentire nodes in an XML document.

Disk I/O PerformanceWhen the height of an XML tree is h, the number of nodesretrieved by one ancestor query is at most h. The quickeralgorithm retrieves ancestor nodes for each node in |Df(1)|,and thus it requires h|Df(1)| node retrievals from the database.Another factor that defines the disk I/O performance of thequicker algorithm is how many node retrievals the heuristicof Figure 5 saves. Let D′

i be a subdomain of Di, which is re-trieved by the quicker algorithm; then, the number of nodesscanned in the quicker algorithm is h|Df(1)| + |D′

1| + · · · +|D′

k|. However, the sweep algorithm consumes all nodes inthe query domains; i.e., it searches |D1|+ · · ·+ |Dk| nodes.When |Df(1)| is sufficiently small, as in the example shownin Figure 5, |D′

i| is typically considerably smaller than |Di|.In addition, the height of the XML, h, is generally limited;only rarely is h larger than 100. Consequently, the quickeralgorithm is often less costly in terms of disk I/O than thesweep algorithm.

This search space reduction is similar to pushing selection,a query optimization technique for relational databases. XMLtypically contains many repeat paths, and therefore, reduc-ing the size of query domains by attaching conditions, suchas predicates on text values, to the path expression queriesis a common method. Hence, the quicker algorithm utilizesa simple optimization to reduce disk I/Os.

5. EXPERIMENTAL RESULTSWe measured the performance of three amoeba join al-

gorithms, brute-force (BF), sweep(SW), and quicker algo-rithm(QK). The first two algorithms can incorporate vari-ous indexing techniques, so we compared them using sequen-tial scans (S) of XML nodes, and more efficient index-basedscans (I). This let to five types of amoeba join algorithms:BF/S (brute-force with sequential scan), BF/I (brute-forcewith index scan), SW/S (sweep join processing with sequen-tial scan), SW/I (sweep join processing with index scan),and QK (quicker algorithm), which is a mixture of indexscanning and join processing.

ImplementationWe implemented our amoeba join algorithms in C++ us-ing B+-trees provided by the BerkeleyDB library [9]. We

Algorithm 3 Quicker Amoeba Join AlgorithmInput: Query domains D1, . . . , Dk and sorting function fOutput: A set of amoebas, R1: Initialize priority queues (sorted by start order) Qi ⇐ empty (i =

1, . . . , k)2: fill the Qf(1) with nodes in Df(1) by fetching from the database

(index scan)3: for i = 1 . . . |Df(1)| do

4: pivot = Qf(1).top

5: query pivot’s ancestor nodes (index scan), then push them intocorresponding Qp(p 6= f(1)).

6: repeat7: s = the smallest start node index in Q1.top, . . . , Qk.top.8: ts = Qs.top // an amoeba root candidate9: pop all entries q ahead of the ts, i.e. ∀q ∈ Qi, q.start <

ts.start10: for j = f(1) . . . f(k) do11: push unread descendant nodes of ts in Dj into Qj . (index

scan)12: goto Step 18 if Qj is empty (ts cannot be an amoeba

root)13: end for14: // all of the Qp(p 6= f(1)) is not empty15: By searching the range of (ts.start, ts.end) in each Qj(1 ≤

j ≤ k, j 6= s), collect descendant nodes of ts, then constructa set of these nodes Aj .

16: As = ts // contains only the current amoeba root candidate17: If every Aj(1 ≤ j ≤ k) is not empty, all permutations of

(A1, . . . , Ak) construct amoeba tuples, so insert them intoR.

18: pop Qs // all amoebas rooted by ts is computed19: until s == f(1) // exit when the pivot node is popped20: end for21: return R

labeled each XML node with (start, end, level, path ID, parent

ID, text). The pair (start, end) is an interval representationof XML nodes [6]. The start value can be used as a uniquenode ID, so parent ID is the start value of a parent node. Thelevel is the depth of a node in the XML tree. The path ID

represents an ID assigned to each independent path. Thetext is a text content encapsulated by tags or attributes.

XML nodes are stored in a B+-tree in ascending order oftheir start values. The sequential scan method (S) reads thestored nodes in this order. The parent node retrieval in thequicker algorithm (QK) also utilizes this B+-tree index. Asfor the index-base scan methods (SW/I and BF/I) and thequicker algorithm (QK), to make node retrieval faster, wegenerated a secondary B+-tree index using a compound key(path ID, start), which aligns XML nodes first in the order ofpath IDs, then that of start values. This secondary index isuseful for finding descendant nodes that belong to specificpaths. In addition, we constructed an inverted index fortext values (text ⇒ start) that looks up the start value (ID)of a node from its text value.

Because the sequential scan method reads the entire list ofnodes to perform a query, it is somewhat analogous to nodestream processing, such as in handling SAX events. Anotherreason to compare the index-based scan methods to the se-quential scan methods is required to assure that the former,using secondary indexes, is not too complex to invoke a lotof random disk access. Too much random access may makequery-processing algorithms slower than a sequential scan ofall records.

The quicker algorithm (QK), used rough estimates of nodefrequencies; if Di, a domain of an amoeba join query has atext predicate, we assume |Di| = 1 or otherwise |Di| = ∞,because the response size of a keyword search is usually lessthan that of a path query. Although a more accurate es-

text

keyword

emph bold

QK SW/I SW/S BF/I BF/S QK SW/I SW/S BF/I BF/S QK SW/I SW/S BF/I BF/S

Q1 2.71 0.39 5.47 > 8d > 8d 22.91 1.97 30.81 > 3y > 3y 62.20 4.17 69.09 > 24y > 24 y

Q2 0.06 0.32 5.57 106.75 115.94 0.05 1.20 29.34 > 0.5h > 0.5h 0.06 2.67 67.12 > 11h > 11h

Q3 0.05 0.11 5.43 20.02 26.42 0.07 3.97 29.41 > 0.1h > 0.1h 0.06 8.95 66.02 > 0.5h > 0.5h

Q4 0.06 0.41 7.98 > 30y > 30y 0.05 10.96 43.41 > 162c > 162c 0.07 22.12 90.95 > 2631c > 2631c

Q1 : (emph, bold, keyword) Q2 : (emph, bold, keyword=>"aboard notes")

Q3 : (item, @id="item100", description) Q4 : (item, @id="item100", description, location, text)

XMark (factor = 0.1, 12M) XMark (factor = 0.5, 57M) XMark (factor = 1.0, 114M)

h : hours (= 3600 sec), d : days (= 24h), y: years (= 365d), c: centuries (= 100y)

Figure 6: Structural fluctuation in XMark (left). Amoeba Join Performance (sec.) (right).

timation strategy could be accommodated, this is sufficientfor locating one of the small domains.

Data SetsIt is difficult to manipulate XML documents with struc-tural fluctuations using current XML technology. As a re-sult, XML document structure is currently rather simpleand monotonous in order to facilitate processing with SAX,DOM or other APIs. Therefore, we could not present a realworld example of fully fluctuated XML data. Such an ex-ample will be possible when XML databases are widespread.Instead, we used a section of XMark benchmark [8], whichcontains a lot of structural fluctuations under its text tags.Figure 6 shows a part of its DataGuide [3], a summary ofpath structure. The cycles in the DataGuide show that threetags keyword, emph, and bold occur in arbitrary order withinthe document.

We prepared three types of XMark document, varyingthe scaling factors (f = 0.1, 0.5 and 1.0). Their structureswere too enormous and too complex to determine the pathstructures for a specific context, showing that amoeba joinis also useful for querying such complicated XML data.

Amoeba Join PerformanceFigure 6 shows the performance of the amoeba join queries(Q1 to Q4). In the brute-force algorithms BF/I and BF/S,some the computational complexity was too huge to com-pute the result; thus, we show their estimation time, whichwas calculated using the permutation size of a query andthe elapsed time for processing its first 500,000 nodes.

In Q1, the quicker algorithm was slower than the sweepalgorithm (SW/I) because the sizes of emph, bold, and key-

word were fairly large. As a consequence, excessive ances-tor node retrievals in the quicker algorithm deteriorated itsperformance. When a query contains predicates (Q2, Q3,and Q4), the quicker algorithm performs an order of magni-tude faster than the others because the size of the domainconstrained by a constant gets smaller. Therefore, a com-bination of QK and SW/I algorithms provides the fastestperformance; when there is no low-frequency domain in aquery, it uses the SW/I, and otherwise it uses the QK.

The performance of SW/S scaled according to the databasesize. Although the time requried to scan the entire databasewas the same from Q1 to Q4, the processing of Q4 in SW/Swas the slowest; because the tuple size k of a query affectsthe join performance. The same is true for SW/I. However,the performance of the quicker algorithm was stable regard-less of tuple size.

6. CONCLUSION & FUTURE WORK

Managing structural fluctuations in XML is a challengebecause the hierarchy of XML does not always have a signif-icant meaning. Amoeba join is a method for querying XMLdata with various structures without using explicit path ex-pressions. Among the presented amoeba join algorithms,the quicker algorithm performed well, and it is scalable tothe size of an XML document.

There are several interesting problems that we did notaddress in this paper. One of them is optional or multipleappearances of nodes within an amoeba. Nodes not includedin an amoeba require another amoeba join algorithm. Elim-inating duplicate node appearances in an amoeba join resultis also an interesting problem to be addressed in the future.This issue is somewhat similar to the operation of the ’dis-tinct’ keyword in XQuery and SQL, although the semanticsof XML structure might be required to reflect the intentionof the user on in query results.

In addition, nested amoeba join should be supported. Forexample,(manager, (org, department ⇒ ”R&D”)) first computesan amoeba set AJ(org, department ⇒ ”R&D”), then for eachamoeba (vi, vj) ∈ AJ repeats a process of the amoeba join(org, vi, vj). Due to limited space, we cannot mention in thismanuscript, its further details will be reported elsewhere.

7. REFERENCES[1] S. Amer-Yahia, L. V. Lakshmanan, and S. Pandit.

FleXPath: Flexible structure and full-text querying forXML. In proc. of SIGMOD, 2004.

[2] D. Chamberlin, D. Draper, M. Fernandez, M. Kay,J. Robie, M. Rys, J. Simeon, J. Tivy, and P. Wadler.XQuery from the Experts. Addison Wesley, 2004.

[3] R. Goldman and J. Widom. DataGuides: Enabling queryformulation and optimization in semistructured databases.In proc. of VLDB, 1997.

[4] S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, andT. Yu. Approximate XML joins. In proc. of SIGMOD, 2002.

[5] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram.XRANK: Ranked keyword search over XML documents. Inproc. of SIGMOD, 2003.

[6] Q. Li and B. Moon. Indexing and querying XML data forregular path expressions. In proc. of VLDB, 2001.

[7] Y. Li, C. Yu, and H. V. Jagadish. Schema-free XQuery. Inproc. of VLDB, 2004.

[8] A. Schmidt, F. Waas, M. Kersten, M. J. Carey,I. manolesch, and R. Busse. XMark: A benchmark for XMLdata management. In proc. of VLDB, 2002.

[9] Sleepycat Software. BerkeleyDB. available athttp://www.sleepycat.com/.

[10] M. Weis and F. Naumann. DogmatiX tracks downduplicates in XML. In proc. of SIGMOD, 2005.

[11] Y. Xu and Y. Papaconstantinou. Efficient keyword searchfor smallest LCAs in XML databases. In proc. of SIGMOD,2005.

Replication-Aware Query Processing in Large-ScaleDistributed Information Systems ∗

Jie XuDepartment of Computer Science

University of Pittsburgh

[email protected]

Alexandros LabrinidisDepartment of Computer Science

University of Pittsburgh

[email protected]

ABSTRACTIn this work, we address the problem of replica selection in dis-tributed query processing over the Web, in the presence of userpreferences for Quality of Service and Quality of Data. In particu-lar, we propose RAQP, which stands for Replication-Aware QueryProcessing. RAQP uses an initial statically-optimized logical plan,and then selects the execution site for each operator and also se-lects which replica to use, thus converting the logical planto an ex-ecutable plan. Unlike prior work, we do not perform an exhaustivesearch for the second phase, which allows RAQP to scale signifi-cantly better. Extensive experiments show that our scheme can pro-vide improvements in both query response time and overall qualityof QoS and QoD as compared to random site allocation with itera-tive improvement.

1. INTRODUCTIONThe Web has become the de-facto user interface and intercon-

nection platform of modern life. Almost all collaborative,data-intensive applications are built for the Web or face obscurity. Manydata-intensive applications are fueled by data from the physicalworld, thanks to the proliferation of (wireless) sensor technologieswhich are giving an unprecedented level of access and interactionwith the real world.

In our Secure-CITI project (http://www.cs.pitt.edu/s-citi/), weenvision a Web-based platform to be used to coordinate humanresponse to disaster management. There is a pre-disaster compo-nent where different types of sensors are deployed in a networkedfashion and are used to detect disasters (e.g., gas and waterusagesuddenly increase dramatically which could indicate a landslide inthat area). There is also a critical component during the emer-gency, where in addition to sensor information, the same systemis expected to be used to provide additional information (e.g., byproviding real-time information about the capacity of areahospi-tals) and to coordinate human response (e.g., by identifying what isneeded to perform a particular task and dynamically formingteamswith the appropriate expertise to respond to it [2]). In sucha sce-

∗Funded in part by NSF ITR Medium Award (ANI 0325353).


nario, many heterogeneous systems are glued together, facilitatingthe discovery and flow of critical information as a response to userrequests.

To improve reliability, expedite data discovery, and increase per-formance, replication is expected to play a major role in large-scaledistributed information systems, like the one we are exploring forSecure-CITI. By replicating information across multiple sites, cru-cial information can be accessed even in cases of disconnection orfailure, which is common during disaster response (and mostenvi-ronments that are exposed to nature). Replication also allows forlooser synchronization across multiple sites, which is necessary ifthe system spans different administrative or jurisdictional domains,which is typical in disaster management. In addition to dataavail-ability, replication allows for easier discovery of information, es-pecially when catalogs are not present or not well maintained (i.e.,the equivalent of unstructured overlay networks). Finally, replica-tion is expected to drastically improve the overall performance ofthe system by reducing communication latency when requestsareserved locally or from close-by nodes. As such, we expect themost“valuable” data to be highly replicated across the entire system.

Although replication increases data availability and improves per-formance (i.e., Quality of Service, or QoS), it may have a detrimen-tal effect to the Quality of the Data (QoD) that are being returnedto the users. Getting results fast is crucial of course, but usually alimit to the degree of “staleness” is needed to make the results use-ful. Approaches for measuring QoD are traditionally grouped intothree categories:time-based(where the time of last update is used),divergence-based(where the difference in value is used), andlag-based(where the number of unapplied updates is used) [9]. Weconcentrate on time-based measures of QoD, because we believethem to be the most general and the best fit for our case.

In this paper, we advocate going beyond simply measuring QoSand QoD. We introduceQuality Contracts(QC) as a novel wayof specifying user preferences (with respect to QoS and QoD)andevaluating the system’s adherence to them. The QC frameworkuti-lizes a market-based mechanism, which has been used in the past tosolve resource allocation problems in distributed systems[11]. Assuch, it provides a natural and integrated way to guide the systemtowards efficient decisions that increase the overall user satisfac-tion. The QC framework also enables users to describe the relativeimportance of different queries and also the relative importance ofthe different quality metrics (e.g., preference for fast answers thatare slightly stale). This results in ”socially” optimal solutions forthe entire system.

Using the QC framework, we propose aReplication-Aware QueryProcessing(RAQP) scheme that optimizes query execution plansfor distributed queries with Quality Contracts, in the presence ofmultiple replicas for each data source. Our scheme follows the

classic two-step query optimization [12, 7, 8]: we start from astatically-optimized logical execution plan and then apply a greedyalgorithm to select an execution site for each operator and alsowhich replica to use. The overall optimization goal is expressed interms of ”profit” under the QC framework (i.e., the approach bal-ances the trade-off between QoS and QoD), and as a special case,in terms of the traditional response time metric.

We provide the assumed system architecture and the QC frame-work in the next section. Section 3 contains the details of our RAQPscheme. We present extensive experimental results in Section 4.Section 5 describes related work. We conclude in Section 6.

2. SYSTEM OVERVIEW

2.1 System ArchitectureWe envision a large-scale distributed information system,where

heterogeneity in all aspects is the norm. Such a system is expectedto bring together (1) a myriad ofreceptors, that are sensing the en-vironment (e.g., RFID readers or sensors), contributing a tsunamiof information, (2) a high number ofcore nodesthat are providinga stable communications, storage, and query processing substrate,and, a plethora ofend-user access devices(e.g., mobile PDAs orlight-weight desktop machines) that are enabling their users to col-laborate, contribute data and knowledge and cooperativelyworktowards a common goal (e.g., disaster response).

We assume that each data item is ”owned” by a specific node,but there is rampant replication in the system for availability andperformance reasons. To facilitate data discovery, we build routingindexes[4] to direct the queries towards the nodes that are expectedto hold relevant data. Each node maintains a local index, summa-rizing its local content. Core nodes maintain second-levelindexes,that summarize index information from all nodes that connect tothem, akin to a hybrid, unstructured P2P overlay network (e.g.,Gnutella2). Core nodes can also exchange information amongthem,building merged indexes to summarize information on other reach-able core nodes within a predefined horizon.

Query processing in our system is performed at core nodes. Anend-user will send out a query message request along with a QualityContract (QC) and a Time-To-Live (TTL) that controls the max-imum number of hops the query could be routed in the network.After receiving a query message, a core node first checks its localindex (in case it can answer the query), and then checks its mergedindex (in case a neighbor code node can answer the query). If thereis a match (while abiding with the specified QC) an acknowledg-ment is sent back to the originator node which builds a query planalong with all the possible options for replicas. If there isno match,the query message is propagated further, until the TTL is reached.

2.2 Quality ContractsIn this work, we introduceQuality Contracts(QC) as a novel

way of specifying user preferences (with respect to QoS and QoD)and evaluating the system’s adherence to them. The QC frameworkutilizes a market-based mechanism, which has been used in the pastto solve resource allocation problems in distributed systems [11].In our framework, users are allocated virtual money, which they“spend” to execute their queries according to their specifications forQoS (i.e., how fast they should get their results) and QoD (i.e., howfresh the results should be). Query servers (core nodes) executequeries and get virtual money in return for their services. The QCframework is general enough to also allow forrefunds: servers payvirtual money back to the users if their queries were answered inan unsatisfying manner. In order to execute a query, both theuserand the server must agree on a Quality Contract (QC). The QC

Worth

to user

$0

$75

$-75

Response

Time (min)

2010

(a) QoS graph

Worth

to user

$0

$25

$-25 Staleness

Degree (min)

2010

(b) QoD graph

Figure 1: QC example: combination of QoS and QoD require-ments for one ad-hoc query

essentially specifies how much money the user is willing to pay toget their queries executed (according to their specifications for QoSand QoD). The amount of money paid to the server depends on howwell the query is fulfilled, based on the user’s preferences.

We model Quality Contracts as a collection of graphs. Eachgraph represents a QoS/QoD requirement from the user. The X-axis corresponds to an attribute that the users want to use inorderto measure the quality of the results, for example, responsetime.The Y-axis corresponds to the virtual money the user is willing topay the server in order to execute his/her query. The QC graphissimply a function that maps quality (i.e., the user’s specification) tovirtual money (i.e., the server’s reward). More than one graphs canbe combined in a single QC, allowing for a user to consider mul-tiple dimensions of quality of results, for example both QoSandQoD. The benefit of such approach is that users can easily specifythe relative importance of the different quality metrics. For exam-ple, 75% of the “budget” of a given query can be allocated to QoSwith the remaining 25% given to QoD; in this scenario, the systemwill give more attention to QoS instead of QoD (which will also beconsidered though). A server is expected to get ”paid” equally tothe sum of all the virtual money from the different QC graphs.

In this work, we use linear functions for QCs; this can easilybeextended to any monotonically decreasing function. Each QCiscomposed of two quality metrics: response time (QoS) and stale-ness degree, measured as time elapsed since the last update (QoD).An example of two such QC graphs is given in Figure 1a and 1b. Inthis example, the user has allocated $75 for optimal QoS, and$25for optimal QoD.

2.3 Query Execution PlanAs in previous work [5], we assume that statistics regardingthe

cardinalities of the relations and the selectivities of theoperatorsare available to the optimizer. We also assume that we can estimatethe transmission cost between nodes and that it is relatively stable,so that the query optimizer can approximate the transmission costincurred in transferring data. To estimate the latency between twoarbitrary Internet end hosts, we utilize approaches from the net-working community such as IDMaps[6] .

In this work, we extend traditional query execution plans witha transmission operatorwhich will allow us to incorporate trans-mission costs in the query optimization process in a naturalandintegrated way. Transmission operators simulating transmissioncost are simply incorporated between each pair of processing op-erators in the query execution plan. In this paper, we consideredSelect-Project-Join operators, however the framework andalgo-rithms can be easily extended to handle non-relational operators,e.g., for querying XML documents (XQuery).

0

11

01

Transmission

Operator

Processing

Operator

Candidate

Sites

Mark

3 117

11 64 2 85

5 96

115 64

Figure 2: A query execution plan example

We present a QEP example in Figure 2. Processing operatorsand transmission operators are depicted as circles. Candidate sites(with replicas) are depicted as boxes. To generate an executableQEP, we need to select one site for each processing operator.

3. RAQP ALGORITHMOur Replication-Aware Query Processing (RAQP) scheme fol-

lows the traditional two-step query optimization scheme[8] whichhas been extensively used in distributed database systems (e.g.,XPRS[7] and Mariposa[12]). In the first phase, RAQP runs as ifin a centralized system, generating the best query plan froma staticpoint of view. In the second phase, RAQP dynamically choosesa replica for each data item and the execution site for each queryoperator, thus creating the final execution plan.

3.1 First PhaseIn the first phase, we adopt the classic dynamic programming al-

gorithm which has been proposed in system R[10]. However, wehave lifted the constraint of left-deep trees from the system R opti-mizer and we are also looking at bushy trees, which is more appro-priate in distributed systems, since they increase the parallelism.

3.2 Second PhaseIn the second phase, we have two tasks to perform: replica selec-

tion (at the leaf nodes) and execution site selection (for the process-ing operators). We take the static query execution tree created fromstatic planning phase as input and generate the physical executionplan as output.

Most previous work, assumed either that each data item has onlyone copy in the system, or that one replica has been selected prior toquery optimization[8]. In most cases, theReplica Selection Prob-lem(RSP) has been neglected. We expect high degree of replicationin the types of systems that we are interested in (e.g., for disastermanagement), and as such, RSP is important. We define RSP for-mally as follows:

DEFINITION 1. Assume a set of data items, D =1, 2,. . . , m,each of which has a set of replicas, S =R1

i , R2i ,. . . ,Rn

i , wherei ∈ D. Each replicaRj

i is a tuple〈pji , s

ji , q

ji 〉 wherep

ji is a price,

sji is the site replicaRj

i resides andqji is the QoD of the replica.

The Replica Selection Problem (RSP) attempts to label each replicaas winning or losing, so as to maximize the processing node’srev-enue under the constraint that exactly one replica of each data item

M ARK-OPT-DIR()1 for each node i2 do if i is a leaf3 then LP (i) = Input Size i4 else LP (i) = estimated local processing i5 PostOrderT raverse6 //TR(i, j) : transmission amount from i to j;7 for each arc i → j8 do Wi,j = (1 − α) × LP (i) + α × TR(i, j)9 for each node j

10 do if j is a leaf11 then Wj = 012 else i = Lchild(j); k = Rchild(j)13 if (Wi,j + Wi) ≥ (Wk,j + Wk)14 then Wj = (Wi,j + Wi); M (j) = 015 else Wj = (Wk,j + Wk); M (j) = 116 return

Figure 3: Algorithm MARK-OPT-DIR

needs to be selected to form the best allocation.

Max(B(alloc) −X

s∈S

pjiXs)

s.t. ∀s ∈ S, Xs ∈ 0, 1 and ∀i ∈ D,X

S|i∈D

Xs = 1

where B(alloc) is the sum of “profit” the processing node getsforfulfilling the quality contract.

Regarding site selection, previous work simply used exhaustivesearch to find the optimal allocation, since only one plan (which isselected by the first phase) will be explored. However, when wecombine replica selection with site selection, exhaustivesearch isprohibitive for this NP-hard problem.

In our algorithm, we treat RSP and site selection in an integratedway. Our goal is to choose one site for each node in QEP, so thatthefinal allocation is the best one in terms of either simply minimizingresponse time or maximizing the processing node’s profit whichis measured by total benefit from QCs minus the money “spent”on acquiring data. Specifically, we take query response timeasQoS and aggregated staleness for QoD. We aggregate staleness byselecting either the highest staleness over the top data items or theaverage staleness of the top-k data items. Currently we assumethat all replicas of a given data item have the same price and thesame size. It is outside the scope of this work to determine how toeffectively select such prices.

3.2.1 Initial Query AllocationGiven a statically optimized query plan, we first determine the al-

location ordering. At each processing node, we set a flag to indicateour estimation of which subtree is more likely to be the bottleneck.Next, we will allocate that subtree first.

Formally, we traverse the tree in post order, for each arci → j,assign weight asα times the transmission cost between i and j plus(1 − α) times the local processing cost at node i. For each inter-mediate node, we calculate the aggregate working load from eachsubtree and mark M(i) to indicate the most expensive one. If itcomes from left, M(i) = 0, otherwise M(i) = 1. This will be our op-timization direction for future use.α ∈ (0, 1) is a dynamic tuningfactor to balance the transmission and processing cost. We use .5in our experiment (i.e., giving equal importance). We provide thepseudo code for this MARK-OPT-DIR algorithm in Figure 3.

After the Query Execution Plan(QEP) is fully labelled, we startallocating in a bottom-up fashion. We traverse the estimated bot-

TRI-ALLOC()1 j = Lchild(i);2 k = Rchild(i)3 //BW (i, j) : bandwidth between i and j;4 α = 1/avg(C(i)), i ∈ Sj

S

Sk

5 β = 1/avg(BW (i, j)), i, j ∈ Sj

S

Sk and i 6= j6 Ttr = β × max(TR(j, i), TR(k, i))7 Tlp = α × min(LP (i), LP (j))8 switch9 caseTtr/Tlp ≥ θ : //Bandwidth− Bound

10 if Sj

T

Sk 6= ∅11 then s = MC(Sj

S

Sk)12 L(i) = L(j) = L(k) = s13 else BW (m,n) = max(BW (x,y))14 x ∈ Sj , y ∈ Sk and x 6= y15 L(j) = m; L(k) = n16 if TR(j, i) > TR(k, i)17 then L(i) = m18 else L(i) = n19 caseTtr/Tlp < θ : //CPU − Bound20 if LP (j) ≥ LP (k)21 then C(p) = max(C(s)), s ∈ Sj ; L(j) = p22 C(q) = max(C(s)), s ∈ Sk − p; L(k) = q23 else C(q) = max(C(s)), s ∈ Sk ; L(k) = q24 C(p) = max(C(s)), s ∈ Sj − q; L(j) = p25 C(r) = max(C(s)), s ∈ Sst i − p − q; L(i) = r26 return

Figure 4: Algorithm TRI-ALLOC

tleneck path by starting from the root, go to left if M(i)=0, andgo to right child otherwise, until we find the leaf (as depicted inFigure 2). We consider one triangle including three nodes asoneallocation unit, and we start from the leaf, then its siblingand thentheir parent. Next time will be their parent, their parent’ssiblingand their grand-parent, and so forth. In each triangle, bothchildrenneed to be either a leaf or both its subtrees to have been allocated,otherwise we allocate the subtrees first.

For each triangle, two allocation algorithms are applied. We callthe first oneRAQP-L which exhaustively explores the local searchspace to decide the optimal one. Alternatives include sending bothrelations to a third node, processing there; sending the smaller re-lation to the larger one, processing there, and assigning all threenodes on the same site, so that local processing is the only cost.

Another algorithm is a greedy one which we callRAQP-G. Foreach triangle, we calculate a score to estimate if this subqueryis CPU-bound, i.e., local processing is the bottleneck, orband-width-bound, i.e, transmission cost dominates response time. Ifit is bandwidth-bound, we try to get rid of transmission by pro-cessing operators in groups. If the candidate sets of two childrenhave an intersection, we allocate all three nodes on the samesite.If there is no intersection, we allocate the parent to the same siteas the slowest child in the transmission. Whenever more thanonesite satisfies the condition, we choose the site which coversmostreferenced data as the winner (since it has more of a chance tobeimproved in a later allocation), and use the QoD as the secondtiebreaker. If it is CPU-bound, we try to spread jobs to different sitesso that the computation could be done in parallel. The two childrenchoose the most powerful site from their own candidate set underthe constraint that they are allocated in different ones. The parentnode is allocated to the most powerful site in the union of itssub-tree’s candidate set under the constraint that it’s different from bothits children.

We give the pseudo code of this greedy algorithm Tri-Alloc inFigure 4.Si is the candidate set of node i, C(t) is the CPU capacity

of site t, MC(S) is the most covered site in set S and L(i) is therunning site we allocated for node i.

3.2.2 Iterative ImprovementAfter an initial allocation is determined, we iteratively adjust the

bottleneck path/node according to our optimization goal.As a special case, we optimize forresponse time, the traditional

performance metric. We first locate the real bottleneck pathun-der the initial allocation and try to improve the bottlenecknode onthis path. If the most costly node is a processing node, therearetwo cases. If more than one operator is running on the currentlyallocated site, we offload the heaviest job to the most light-loadedsite. If there is only one job running, we move the job to anothermore powerful site. If the most costly node is a transmissionnode,we remove that by merging the two processing nodes into the onewhich has lighter load between the two. After each adjustment,the response time of the new plan is recalculated. The changeisaccepted if it leads to an improvement, and then the process is re-peated. Otherwise, the improvement step stops and the current al-location is returned as final one.

In general, our optimization goal is to maximize theoverall “profit”under the QCs. We need to consider both QoS and QoD in thiscase. Our improvement step is divided into two substeps. First, welocate and improve the bottleneck on response time in the same wayas described above. The modification will be accepted if the totalprofit is improved, otherwise we label this try as failed. Second, welocate the bottleneck replicas in this plan based on their stalenessdegree. Once found, we replace those replicas with other ones thatimprove the overall QoD (thus satisfying more QCs). The processis repeated until neither QoS nor QoD can be further improved.

Simulation Parameter Default ValueCore Node Number 100Edge Node Number 1000Unique Data Source 1000

Unique Data Number Per Data Source U(10, 100)Data Size U(20, 200Mb)

# of Replicas Per Data U(10, 30)Bandwidth between each pair of NodesU(1, 50Mbps)

Table 1: Default System Parameters in Experiments

4. EXPERIMENTAL STUDYWe evaluated our proposed replication-aware query processing

algorithm experimentally by performing an extensive simulationstudy using the following algorithms:

• Exhaustive Search (ES):Explore the whole search spaceexhaustively, thus guaranteeing to find the optimal allocation.

• RAQP-G: Greedy replication-aware initial allocation plus it-erative improvement.

• RAQP-L: Bottleneck breakdown, local exhaustive search plusiterative improvement.

• Rand(k): Random initial allocation plus k steps of iterativeimprovement. In each step, the bottleneck node is identifiedand a random replacement is selected, if our optimizationgoal is improved. We use it as a “sample” of the search space.

We implemented an initial prototype of our distributed system,as described in section 2. The system parameters used in our ex-periments are reported in Table 1.

0

50

100

150

200

250

300

350

400

450

10 20 30 40 50 60

replication density

response time

ES

RAQP-L

RAQP-G

Rand(5)

Rand(1)

Figure 5: Response time under various degree of replication

4.1 Optimizing for Response TimeIn the first set of experiments, our optimization goal is to min-

imize the response time. We evaluate the different optimizationalgorithms under various circumstances. To avoid bias in the re-sults, we repeated each experiment 5 times with different randomseeds, and report the average values.

4.1.1 Effect of Replication DegreeWe artificially varied the number of replicas per data item inthe

system. We optimize queries with 6 joining relations. The qualityof the resulting plans is reported in Figure 5. Clearly, our RAQP al-gorithms greatly outperformed Rand(5), and RAQP-L was a bitbet-ter than RAQP-G. As expected, ES always finds the optimal plan.

An obvious observation is that the quality of the resulting plansis improved as the number of replicas is increased. There aretworeasons. First, as the number of replicas increases, we get morenearby candidate sites (which improves response time). Second,more replicas mean more candidate sites for execution, which in-creases the search space (and improves the overall responsetime).

4.1.2 Optimization OverheadExhaustive search can always find the optimal plan, however,its

running time is prohibitive in large-scale systems. We fixedthenumber of replicas to 20 and looked at the running time of differentalgorithms versus the quality of the resulting plans. We report ourfindings in Table 2(a). Clearly, ES took around 2 days to find theoptimal plan, while our RAQP-G algorithm took around 70 ms.

We also report the optimization time for queries with 3 and 1joins in Table 2(b) and 2(c). We found similar trends in all three

ES RAQP-L RAQP-G Rand(5) Rand(1)resp. time 11.62s 49.23s 58.11s 233.2s 346.5sopt. time 1.9days 53min 70ms 28ms 15mstotal time 1.9days 54min 58.18s 233.2s 346.5s

(a) Optimization time for queries with 6 joins

ES RAQP-L RAQP-G Rand(5) Rand(1)

resp. time 16.57s 20.34s 25.14s 87.08s 94.25sopt. time 9.58min 2.61min 33ms 20ms 11mstotal time 9.86min 2.95min 25.17s 87.1s 94.26s

(b) Optimization time for queries with 3 joins

ES RAQP-L RAQP-G Rand(5) Rand(1)resp. time 10.38s 10.87s 11.28s 20.01s 25.27sopt. time 40ms 23ms 5ms 2ms 1mstotal time 10.42s 10.9s 11.28s 20.01s 25.27s

(c) Optimization time for queries with 1 joins

Table 2: Optimization time for join queries

0

20

40

60

80

100

120

140

100 200 300 400

network size (node number)

response time

ES

RAQP-L

RAQP-G

Rand(5)

Rand(1)

Figure 6: Response time under various network sizes

experiments. We did not increase the number of joins beyond 6as we would not have been able to compare with the exhaustivealgorithm.

4.1.3 Effect of Network SizeIn the second set of experiments, we varied the network size to

observe the changes in optimization quality. In order to show thetrend clearly, we chose 100 to 400 as the network size. The re-sults are reported in Figure 6. A jump, from 100 to 200 makessmall difference in the relative quality change. However, when thenetwork size increased to 300, 400, the random algorithm showedmuch worse performance compared to others. The reason is whenthe system enlarges, our search space becomes sparser. Not sur-prisingly, the blind random search performed much worse than ourinformed heuristic search. For the above sets of experiments, wealso confirmed that our algorithm scaled very well in both runningtime and the optimization quality. RAQP is also relatively stableunder various network size and data loads.

4.2 Optimizing for ProfitIn this set of experiments we tune our algorithms to optimizefor

the overall QC profit instead of simply response time.One of the important features Quality Contracts hold is thatusers

can easily specify the relative importance of each component of theoverall quality by allocating the query budget accordingly. In or-der to observe the algorithm performance under different environ-ments, we classify the users’ quality requirements into 6 classes.We have three values for QoS and QoD: high (75), low (25), same(50) and two types of slope for the QC function: small and large,which produce 6 seperate classes. Each data item had 20 replicas.We report our results in Figure 7.

Since our allocation initialization algorithm was aimed atre-sponse time improvement, QoS got more improvement than QoDin all the cases. Especially when QoS was assigned higher budget,the effect on both QoS and total profit were obvious. When QoDwas assigned higher budget, the relative improvement of QoDalsoincreased compared to the lower budget case.

Our results clearly confirmed the functionality of Quality Con-tracts and our RAQP algorithm. Assigning higher ”budget” toaquality dimension ends in that dimension getting better performanceby our optimization algorithm. The larger the budget difference thelarger the difference in the resulting quality. This is behavior isunique to our algorithm and is an important feature to have whenboth QoS and QoD are of concern to users.

5. RELATED WORKMariposa[12] is the first distributed DBMS to use economic schemes

as the underlying paradigm. Queries are submitted to the Mari-posa system with a bid curve on delay; a broker sends out requests,

0

10

20

30

40

50

60

70

80

90

100

Profit

Rand(1) Rand(5) RAQP-G RAQP-L ES

QoD

QoS

(a) QoS≫ QoD, QCslope:Large

0

10

20

30

40

50

60

70

80

90

100

Profit


QoD

QoS

(b) QoS = QoD, QCslope:Large

0

10

20

30

40

50

60

70

80

90

100

Profit


QoD

QoS

(c) QoS≪QoD, QCslope:Large

0

10

20

30

40

50

60

70

80

90

100

Profit


QoD

QoS

(d) QoS≫ QoD, QCslope:Small

0

10

20

30

40

50

60

70

80

90

100

Profit


QoD

QoS

(e) QoS = QoD, QCslope:Small

0

10

20

30

40

50

60

70

80

90

100

Profit


QoD

QoS

(f) QoS≪ QoD, QCslope:Small

Figure 7: Total profit of the algorithms under different classes

collects bids and decides on the best plan; bidders are discoveredthrough name servers. The main difference from our work is thatwe addressed QoD as another quality measure. Second, we applieddifferent query processing schemes in our system. Third, weavoidthe cost of building and maintaining name servers in our system.

Another system close to ours is Borealis[1], a distributed streamprocessing engine, which inherits stream processing functionalityfrom Aurora[3]. Aurora adopted a utility-function based QoS modelon the processing delay of output tuples. Borealis also usesmulti-dimensional quality metrics which could include response time,quality of data and etc. The difference from our work (QC) is thatBorealis simply measures the overall quality by calculating an ag-gregated value from a global weight function, and the weightforeach quality dimension is fixed. In our work, each user can allocatethese weights and differentiate among queries and quality metrics.

There is a lot of work on distributed query optimization, buttothe best of our knowledge, we are the first to address the problemof replica selection under quality guarantees.

6. CONCLUSIONS AND FUTURE WORKIn this paper, we introduced Quality Contracts (QC) as a unifying

framework to enable users to specify their QoS and QoD require-ments and their relative importance. Additionally, we proposed areplica-aware query processing scheme and demonstrated that itworks fairly well for both optimization goals, response time andtotal profit from multi-dimensional quality requirements.In thispaper, we focused on SPJ queries, however, we believe the pro-posed framework and algorithms can be easily extended to coverXML data and XQuery.

7. REFERENCES[1] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, et al.

“The Design of the Borealis Stream Processing Engine”. In2005 CIDR conference, January 2005.

[2] A. Berfield, P. K. Chrysanthis, and A. Labrinidis.“Automated Service Integration for Crisis Management”. In

First International Workshop on Databases in VirtualOrganizations (DIVO 2004), June 2004. (held in conjunctionwith SIGMOD 2004).

[3] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee,G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik.“Monitoring Streams - A New Class of Data ManagementApplications.”. In2002 VLDB conference.

[4] A. Crespo and H. Garcia-Molina. “Routing Indices ForPeer-to-Peer Systems”. In2002 ICDCS conference.

[5] A. Deshpande and J. M. Hellerstein. “Decoupled QueryOptimization for Federated Database Systems”. In2002ICDE conference.

[6] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt, andL. Zhang. “IDMaps: a global internet host distanceestimation service”.IEEE/ACM Trans. Netw., 9(5):525–540,2001.

[7] W. Hong and M. Stonebraker. “Optimization of ParallelQuery Execution Plans in XPRS”. InPDIS 1991, December,1991, pages 218–225. IEEE Computer Society.

[8] D. Kossmann. “The state of the art in distributed queryprocessing”.ACM Comput. Surv., 32(4):422–469, 2000.

[9] A. Labrinidis and N. Roussopoulos. “Exploring the tradeoffbetween performance and data freshness in database-drivenWeb servers”.VLDB J., 13(3):240–255, 2004.

[10] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A.Lorie, and T. G. Price. “Access path selection in a relationaldatabase management system”. In1979 SIGMODconference.

[11] J. Shneidman, C. Ng, D. C. Parkes, A. AuYoung, et al. “WhyMarkets Could (But Don’t Currently) Solve ResourceAllocation Problems in Systems”. InHotOS X, June 2005.

[12] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah,J. Sidell, C. Staelin, and A. Yu. “Mariposa: a wide-areadistributed database system”.The VLDB Journal,5(1):048–063, 1996.

Automatic Tuning of File Descriptors in

P2P File-Sharing Systems Dongmei Jia, Wai Gen Yee, Ophir Frieder

Illinois Institute of Technology Chicago, IL 60616, USA

[email protected], [email protected], [email protected]

ABSTRACT

Peer-to-peer file-sharing systems have poor search performance for rare or poorly described files. These files lack a quality or variety of metadata making them hard to match with queries. A server can alleviate this problem by gathering the descriptors used by peers via what we call probe queries, and use these descriptors to improve its own. We consider probe query triggering mechanisms and criteria for selecting a file for which to probe in this work. Experimental results indicate that probe queries are effective in improving search performance.

Categories and Subject Descriptors

H.3.5 [Information Storage and Retrieval]: Online Information Services – Web-based services.

General Terms

Algorithms, Performance, Experimentation.

Keywords

Metadata management, P2P Search.

1. INTRODUCTION Peer-to-peer (P2P) file-sharing systems like Limewire’s Gnutella [3] are extremely popular with millions of daily users sharing petabytes of data [9]. These systems are highly effective in locating popular files, but less so for rare and poorly described ones [1]. [2] shows that different ranking functions may improve the identification of rare files in query result sets, but does not address the issue of retrieving these files from servers in the first place. Only after a result is returned to a client can it be ranked. Hence, to improve a query’s performance, a good approach is to improve its recall (i.e., its likelihood of matching relevant files).

Improving recall is an important goal of P2P file-sharing systems. Ostensibly, peers publish files because they desire to share them. For example, a new music group may want to publicize its new song or a P2P client may want others to download its files to improve its reputation in the system.

The problem with search in P2P file-sharing systems is that, although multiple replicas of a file may exist, they are maintained

at different servers, and are described with small, user-defined descriptors, which are usually filenames. Therefore, although the system’s aggregate metadata for a file might effectively describe it, it is possible that no single descriptor is sufficient.

This negative consequence of the distributed nature of the description of files in P2P file-sharing systems manifests itself in poor search performance. In practical P2P file-sharing systems, queries are conjunctive: a file matches a query if all query terms are contained in the file’s descriptor [6]. Consider an example where D1, D2 are two descriptors of replicas of the same file, F, from two different servers, and Q is a query. If D1 = t1, D2 = t2, and Q = t1, t2, where t1 and t2 are terms, the evidence is strong that F is relevant to Q. However, because of the (conjunctive) matching criterion, neither D1 nor D2 are returned as Q’s result.

Our work focuses on increasing the richness of the descriptors of shared files by the use of probe queries, which we define as a query meant to search the network for other replicas’ descriptors of a given file. With the results of these queries, the client can discover alternate ways of describing its local replica of the file. For example, the server of D1 in the example above may send a probe for F, yielding D2. The server can then transform D1 into

D1’ = D1∪D2 = t1, t2. With D1’, the server can correctly match its replica of F with Q.

The basic goal of a probe query for a file is to automatically aggregate the available metadata associated with a replica of the file that exist on other peers in the system, and then incorporate these metadata into the local replica’s descriptor. The enhanced local descriptor makes the replica more identifiable, increasing the likelihood that a query correctly matches it.

In this paper, we consider the design of a probing system in a P2P file-sharing system. We cover the issues of determining when a peer should probe, what file it should probe, and what it should do with the probe results.

2. RELATED WORK Recent work proposes to identify rare files in search results [1][2]. However, these works do not consider the more basic problem of retrieving the rare data in the first place.

Limewire’s Gnutella attempts to improve recall and limit bandwidth consumption by increasing the time-to-live of queries that yield few results [3]. Again, this system assumes that data are described adequately enough to be returned by a query.

There are also classes of work that strive to improve search in P2P environments that share text documents [4]. Such systems do not suffer from query over-specification because the sought-after

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright is held by the author/owner. Ninth International Workshop on the Web and Databases (WebDB

2006), June 30th, 2006, Chicago, IL, USA.

document is self-describing, and does not rely on small, user-defined descriptors for matching.

Other work tries to improve performance by “intelligently” routing queries to the most-relevant servers [5]. These works assume that files are adequately described for queries to return matches.

Some work assumes that file descriptors are structured [17]. Structure constrains queries and may improve performance. Although our application assumes that descriptors are small and unstructured, there is no reason why our techniques cannot be applied to the structured data environment.

Our approach bears some resemblance to work that either adds terms to queries or to descriptors to improve recall [16]. However, in practice, there exists no standard mechanism to for adding terms and it is unclear what impact this would have on query performance or cost. Moreover, such techniques are generally not applicable when queries are conjunctive.

In general, our work takes a different approach, focusing on the client application level by improving recall by enhancing data description.

3. MODEL Peers collectively share (or publish) a set of (binary) files by maintaining local replicas of them. Each replica is represented by a descriptor, which also contains an identifying key (e.g., an MD5 hash on file’s bits). All replicas of the same file naturally share the same key. A query issued by a client is routed to all reachable peers until the query’s time-to-live expires. The query matches a replica if it is contained in the replica’s descriptor. For each match, the server returns its system identifier and the matching replica’s descriptor.

Formally, let O be the set of files, M be the set of terms, and P be

the set of peers. Each file o1,o2∈O has an identifier associated with it, denoted ko1, such that ko1=ko2 if and only if o1=o2. We also refer to ko1 as the key of file o1 (e.g., the MD5 hash value mentioned above).

Each file o∈O has a set of terms that validly describe it. We

denote the set of valid terms for o as To⊆M. Intuitively, To is the set of terms an average person would “most likely” use to describe

o. Each term t∈To has a strength of association with o, denoted

soa(t, o), where 0≤soa(t, o)≤1 and ∑t∈Tosoa(t, o)=1. The strength of association a term t has with a file o describes the relative likelihood that it is to be used to describe o, assuming all terms are independent. The distribution of soa values for a file o is called the natural term distribution of o.

A peer s∈P is defined as a pair, (Rs, gs), where Rs is the peer’s set

of replicas and gs is the peer’s unique identifier. Each replica

ros∈Rs is a copy of file o∈O, maintained by peer s, and has an

associated descriptor, d(ros)⊆M, which is a multiset of terms that

is maintained independently by s. Each descriptor d(ros) also

contains ko, the key of file o. The number of terms that a descriptor can contain is fixed.

A query Qo⊆To for file o is also a multiset of terms. The terms in

Qo are expected to follow o’s natural term distribution. When a query Q arrives at a server s, the server returns result set

UQs=(d(ro

s), gs) | ro

s∈Rs and Q⊆d(ros) and Q≠Ø—membership in

the result set requires that a result’s descriptor contain all query terms, in accordance with the matching criterion,.

The client receives result set UQ=∪sUQ

s, s∈P, and groups individual results by key, forming G=G1, G2, …, where Gi=(di, i,

li), di=⊕d(ris) | (d(ri

s), gs)∈UQ and ki=i is the group’s descriptor, i

is the key of Gi, and li=gs | (d(ris), g

s)∈UQ and ki=i is the list of

servers that returned the results in Gi. In this definition, ⊕ is the multiset sum operation.

The client assigns a rank score to each group with function Fi∈F,

defined as F: 2M×2M

×Z×Z→R +. If Fi(dj, Q, |Gj|, timej) > Fi(dk, Q,

|Gk|, timek), where Gj, Gk are groups, then we say that Gj is ranked higher than Gk with respect to query Q. In these definitions of F, |Gj| is the number of results contained in a group, and timej is the creation time of the Gj (i.e., the time when the first result in Gj arrived).

In commercial systems, such as various versions of Gnutella and eDonkey, popular P2P file-sharing systems, the ranking function is based on group size:

FG(d, Q, a, b) = a.

Descriptors in these systems are generally implemented via filenames, although a small amount of descriptive information may be embedded in the actual replica (e.g., ID3 data embedded in MP3 files [14]).

Result keys are generated by MD5 or SHA-1 hashing, and results are grouped based on these keys. Furthermore, when a client downloads a file, the descriptor of this new replica is initialized as a duplicate of one of the servers in the result set.

To simplify our explication, we will use the term “result” informally to describe either a group or an individual result, and clarify the usage if necessary. We refer to the collective set of terms contained in descriptors as metadata.

4. PROBING

4.1 Implementation of Probe Queries A probe query for a file o is implemented as a query that contains a single term – the key of a file, ko. Because, by assumption, every descriptor contains a key, this query is guaranteed to match all the replicas of a particular file that it encounters. The use of keys to find replicas occurs in practice. The popular P2P file-sharing system, DC++ (dcpp.net), for example, relies on the existence of keys to identify other sources of a file for multi-source downloading.

Probe query routing can be done via the same routing mechanism used by ordinary queries – controlled flooding via “ultrapeers” [7]. Alternatively, due to the simple nature of the query, an inverse index, consisting of a key and a list of servers can be created on top of a DHT-like routing infrastructure or other indexing mechanisms (e.g., [15]) to save network resources [8].

4.2 Steps in Probing There are four steps to probing: First, some mechanism triggers the probe in the client. If so, then the client selects a file to probe, collects the results, and then applies the results to the descriptor of the probed file.

4.2.1 Triggering the Probe In the base case, peers can probe at regular or randomly distributed intervals. However, more selective probing makes better use of system resources. For example, a peer that is relatively busy should not bother probing, as it is already

burdened. Rather, a peer that is unsuccessfully trying to share its large library should probe. By doing so, it also relieves the burden of the other peers.

We identify three factors that help us to determine the appropriateness of probing for a peer: number of queries received (Nq), number of responses returned (Nr), and number of files published (Nf). Given a user-defined threshold T, if the following condition holds, the peer performs a probe:

T < NfNq / (Nr+1) - NpT,

where Np is the number of probes the peer has already performed.

The probe threshold condition makes Nr the basic metric that describes a peer’s participation or utilization in the system. A high Nr can be taken to mean two things: either the peer is adequately advertising its shared files, and/or it is busy. In either case, the peer should not increase its workload by probing.

The degree of participation, however, should be in proportion to the desire of the user to participate, which we measure by Nf. Therefore, a peer that shares many files (high Nf) but does not respond to many queries (low Nr) should probe.

The motive force behind probing, however, is the total activity level in the system, as perceived by the client. If there is a low level of activity in the system, measured by Nq, then probing is a waste of resources. Only in an active system should probes be performed. We therefore add Nq to the threshold condition.

The given condition, therefore, measures the number of responses per shared file and per query. By fixing T over all peers, we tacitly assume that all files are equally popular.

The second term in the condition above “resets” the threshold condition after each probe.

4.2.2 Selecting a File to Probe Once a peer has decided to perform a probe, it must identify a file to probe. Because our goal is to utilize all peers, the logical choice is the file that most contributed to satisfying the probe threshold. By definition, this is the file that has been returned the fewest times as a query result. Let Ni

m be the number of times a local replica ri has matched a query (i.e., the number of times it has been returned to a client). A file-picking criterion, therefore, is

Criterion 1: Probe ri, where Nim is minimum over all i.

In other words, a low Nim indicates that ri is not being returned at

an adequate rate, and may need help in its description.

If we assume that Nim is a measure of file popularity, one of the

features of Criterion 1 is that it allows the less popular data items to be probed. The benefit of this is that probing popular files should not be necessary as the likelihood that a query does not match at least one of the available replicas is slim.

Another option is to select a file based on how it is described. A file with a small descriptor is not likely to be found because queries over-specify them (i.e., contains terms that are not in the file’s descriptor). Let |d(ri)| be the number of unique terms in the descriptor d(ri) of ri.

Criterion 2: Probe ri, where |Di| is minimum over all i.

Both of the file-picking criteria we just described suffer from the inability to handle situations where probing has no effect on either Ni

m or d(ri). In such a case, the same file will be probed repeatedly

to no use. To solve this problem, we use Criterion 1 as our file-picking criterion, and after every probe of ri, we do the following:

If Nim = 0 then Ni

m ← 1 else Nim ← 2 Ni

m.

By doubling the metric used in Criterion 1, we reduce the likelihood of all probes being performed by a particular peer are for the same file. Such a situation may still arise, of course, but only in degenerate cases where a particular file is very unpopular. In the event of a tie in Criterion 1, we use Criterion 2.

4.2.3 Copying Terms to the Local Descriptor Once probe results for ri are returned to the client, they are grouped into a multiset. The client then selects terms from this multiset to add to d(ri), the local descriptor of ri.

As discussed in [2], there are many ways to copy terms into the local descriptor: randomly, by frequency, and so on. To simplify the presentation of results, we randomly copy terms from the probe results into d(ri), biasing the likelihood that a term is copied by its relative frequency. For example, a term that occurs twice as frequently as another in the probe results is twice as likely to be copied. This process repeats until the local descriptor is full.

In our experiments, copying the most frequent terms also does well, but at the expense of slightly higher cost. We therefore leave out these results.

5. EXPERIMENTIAL RESULTS We simulate the performance of a P2P file-sharing system to test the large scale performance of our methods. In accordance with the model described in [10] and observations presented in [11], we enhance our experimental model with interest categories, which model the fact that some users have stronger interests in some subsets of data than other. We partition the set of files, O,

into sets Ci, where Ci⊆O, Ci∩Cj=Ø if i≠j, and ∪iCi=O. At initial-

ization, each peer s∈P is assigned some interests Is⊆∪iCi, and is allocated a set of replicas Rs from this interest set: Rs=ro

s |

o∈∪iCi, where Ci∈Is. For each replica, ros, allocated at initializa-

tion, d(ris)⊆Ti, where term allocation is governed by natural term

distributions. Peer s’s interest categories also constrain its searches;

it only searches for files from ∪iCi, where Ci∈Is.

Each category Ci has an assigned popularity, bi, which describes how likely it is to be assigned to a peer. The values of bi follow the Zipf distribution [10].

Within each interest category, each file varies in popularity, which is also skewed according to the Zipf distribution [10]. This popularity governs the likelihood that a peer who has the file’s interest category is either initialized with a replica of the file or decides to search for it.

Peers in our simulator are populated with TREC data from the 2GB Web track (WT2G), where Web domains, documents in the domains, and terms in the documents are mapped to interest categories, files in categories, and files’ valid terms, respectively. Natural term distributions are based on the term distributions within the Web pages. An initial set of replicas with random descriptors of associated terms is allocated to peers based on pre-assigned interest categories. Queries for files are generated using valid terms with a length distribution typical of that found in Web search engines [12] as shown in Table 1 and also exhibited in our P2P file-sharing system query logs.

The simulation parameters shown in Table 2 are based on observations of real-world P2P file-sharing systems and are comparable to the parameters used in the literature.

The data set used consists of an arbitrary set of 1,000 Web documents from 37 Web domains. Terms are stemmed, and HTML markup and stop words are removed. The final data set contains 800,000 terms, 37,000 of which are unique.

We also conducted experiments on other data sets with other data distributions, but, due to space constraints, we only present a representative subset of our results. The data we used for these experiments can be found on our Web site [13].

Table 1-Query Length Distribution.

Length 1 2 3 4 5 6 7 8

Prob. .28 .30 .18 .13 .05 .03 .02 .01

Although other behavior is possible, we assume that the user identifies and downloads the desired result group with a

probability 1/r, where r≥1 is its position in the ranked set of results.

Table 2-Parameters Used in the Simulation.

Parameter Value(s)

Num. Peers 1000

Num. Queries 10,000

Max. descriptor size (terms) 20

Num. terms in initial descriptors 3-10

Num. categories of interest per peer 2-5

Num. files per peer at initialization 10-30

Num. trials per experiment 10

Performance is measured using a standard metric known as mean reciprocal rank score (MRR) [18], defined as

q

N

ii

N

rankMRR

q

∑ =

=

1

1

,

where Nq is the number of queries and ranki is the rank of the desired file in query i's result set. MRR is an appropriate metric in applications where the user is looking for a single, particular result.

For reference, we also present precision and recall, which have slightly different definitions than they do in traditional IR, due to the fact that replicas exist in the P2P file-sharing environment, and assuming that queries are for particular files. Let A be the set of replicas of the desired file, and R be the result set of the query. Precision and recall are defined as:

||

||

R

RAprecision

∩= ,

||

||

A

RArecall

∩= .

These more traditional IR metrics are useful in roughly diagnosing the performance of query processing and in generalizing the presented performance to other domains.

5.1 Triggering the Probe We now consider the impact of various probe triggering techniques on performance. The three techniques we consider are:

1. No probing.

2. Using the threshold, described in Section 4.2.1.

3. Randomly selecting a peer to probe.

For the probing cases, we perform 5,000 probes. To do this using the probe threshold, we tune T so that, after 10,000 queries, approximately 5,000 probes are issued. For random probing, we assign to each peer a probability of probing during each iteration of the simulation that results in 5,000 probes after 10,000 queries.

Experimental results shown in Figure 1 clearly indicate that prob-ing improves query performance. Probing randomly increases MRR by 20%. Probing using the threshold, however, increases MRR by 30%.

0

0.1

0.2

0.3

0.4

0.5

noprobe random T5K

Probe Triggering Technique

MRR

Figure 1-MRR with Various Probe Triggering Techniques.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 3 4 5 6 7 8 9 10

Popularity Partition

Ratio rand

T5K

Figure 2-Ratio of Number of Probe Queries to Number of Files

for Various Popularity Partitions.

The reason for this performance improvement is because probes are correctly being performed by under-utilized peers. This is suggested by the fact that probes are generally directed toward the files that are perhaps more difficult to find due to their lower popularity. To verify this, we partitioned the set of files into ten popularity partitions. Each partition represents 10% of the files, where partition 1 contains the most popular 10% of files, partition 2 contains the next 10% most popular files, and so on. In Figure 2, we show the ratios of number of probe queries issued for files in each popularity partition and the number of replicas in each popularity partition. The results indicate that more of the probes are being performed on less popular data when using the threshold to pick files. As expected, when probing is performed randomly,

the ratios remain constant over all partitions, as all replicas are equally likely to be probed.

We also ran experiments where we varied the threshold T. To control these experiments, we set T to values such that, after 10,000 queries, there were 2,500, 5,000, 7,500, and 10,000 probes. The results, shown in Figure 3 are intuitive. As the number of probes increases, so does performance. However, the rate of performance increase decreases with an increasing number of probes. Additional probing after most files have been probed has marginal value.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

noprobe T2.5K T5K T7.5K T10K

Number of Probes

MRR

Figure 3-Effect of Various Probing Rates on MRR.

5.2 Alternative Probe File Selection

Techniques We now consider the performances of two alternative file selection techniques:

1. Randomly selecting a file to probe.

2. Selecting a file to probe based on the criteria discussed in Section 4.2.2.

These experiments assume that we are triggering probes using the threshold T5K (T tuned to perform 5,000 probes after 10,000 queries).

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

random criteria

File Selection Technique

MRR

Figure 4-MRR with Various Probe File Selection Techniques.

The results of these experiments, shown in Figure 4, suggest that the results are a wash. There is very little difference in MRR between the two techniques. A deeper analysis, (not shown here), shows that a larger proportion of probes, when using the “criteria,” are directed toward already popular files. This is wasteful in two ways. First, the probes are unnecessary for popu-lar files, and, second, because popular files are queried frequently,

cost, in terms of the number of results per query, (which we consi-der later), increases unnecessarily.

In other words, there is significance to the different file selection techniques that are not revealed by MRR in these experiments. We have used other file selection techniques that maximize the identification of rare files, but often at the expense of overall MRR. More in-depth treatment of other file-selection techniques is the subject of ongoing work.

5.3 Cost Analysis We define cost as the number of query responses received by the client. This metric roughly estimates the amount of work the client must perform to process a query. More importantly, this metric roughly estimates network cost in a topology-independent way.

Probing increases the cost of each query. By enhancing data description, we also increase the likelihood that a query will match some file. The increase in cost can be significant.

To counter this cost, we propose server-side Bernoulli sampling of the result set for each query. That is, for each matching result for a query, the server decides to return it to the client with a fixed probability Pr, 0≤Pr≤1. This type of sampling is expected to preserve the overall distribution of terms and results in the result set. It also allows us to predictably reduce the cost by a factor Pr. In the experiments in this section, we arbitrarily use “criteria” probe file selection. The results for random probe file selection are similar.

Figure 5-Responses per Query for Different Probing and

Sampling Rates. The black line indicates cost without probing.

As shown in Figure 5, cost increases with probing range from 36% to over 100% with varying thresholds if sampling is not used. Predictably, with sampling, costs are reduced by, approximately, a factor Pr. The cost decrease factor is slightly greater than Pr because, in a well-running P2P file-sharing system, the average number of results per query is high because more peers are actively sharing more files.

Sampling, in fact, is able to reduce the cost of probing to levels below that of not probing with no sampling. This decrease in cost from the base case can be over 50%.

Sampling, however, has a negative impact on MRR. This is the case because it is likely that, for some queries, the desired result will be selected out of the result set. The question is whether the decrease in MRR offsets the improvements in cost.

0

20

40

60

80

100

120

T2.5K T5K T7.5K T10K

Threshold

Re

sp

on

se

s p

er

Qu

ery

1

0.75

0.5

0.25

Figure 6-MRR with Different Probing and Sampling Rates.

The black line indicates MRR without probing.

Fortunately, MRR decreases at a slower rate than cost, as shown in Figure 6. MRR decreases by at most about 15% with a sampling rate of 25%. However, in these experiments, MRR is never worse than when not using probing. For example, when using T10K probing and 25% sampling, MRR is approximately 20% better than when not probing, and cost is 35% lower. Probing with sampling can therefore lead to a win-win situation in terms of both ranking performance and cost. Based on this per-formance, it seems likely that a good way of designing a probing system would be to maximize probing rate, and then reduce cost, via sampling, as necessary.

The reason for this positive performance/cost behavior is due to the effect of probing on recall and precision. Result sets are generally of a higher quality in terms of these two metrics as shown in Figure 7. The increased precision in particular, reduces the likelihood that sampling will eliminate all relevant results from a result set.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

noprobe T2.5K T5K T7.5K T10K

Threshold

recall

precision

Figure 7-Recall and Precision with Various Probing Rates.

6. CONCLUSION Given the conjunctive matching criterion of today’s P2P file-sharing systems, poor data description limits overall performance. Probe queries help solve this problem by automatically tuning local descriptors using those of peers. Our experimental findings demonstrate that it is possible to improve performance with probes with very little (potentially negative) cost.

We are now considering ways of better controlling exactly where probes are directed (i.e., more or less popular files). We are also working on models that help in tuning threshold and sampling values in a distributed manner.

7. REFERENCES [1] B. Loo, J Hellerstein, R Huebsch, S Shenker and I Stoica.

Enhancing P2P File-sharing with an Internet-Scale Query Processor. In Proc,VLDB Conf. August, 2004.

[2] W. G. Yee, D. Jia, and O. Frieder, Finding Rare Data Objects in P2P File-sharing Systems, In Proc. of the Fifth IEEE Intl.

Conf. on Peer-to-Peer Computing, Germany, September 2005.

[3] Limewire. http://www.limewire.org.

[4] J. Lu and J. Callan, Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks, Proc. Euro.

Conf. on Inf. Retr., 2005.

[5] K. Sripanidkulchai and B. Maggs and H. Zhang, Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems, Proc. IEEE INFOCOM, 2003.

[6] C. Rohrs, Keyword Matching [in Gnutella], LimeWire

Technical Report, Dec., 2000, www.limewire.org/techdocs/KeywordMatching.htm.

[7] T. Klingberg and R. Manfredi, Gnutella Protocol 0.6, 2002, rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html.

[8] I. Stoica, Robert Morris, D. Karger, F. Kaashoek and H. Balakrishnan, Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Proc. SIGCOMM, 2001.

[9] Slyck.com P2P File-sharing Statistics. http://slyck.com/stats.php

[10] M. T. Schlosser, T. E. Condie, and S. D. Kamvar. Simulating a file-sharing p2p network. In Proc. Wkshp. Semantics in

Peer-to-Peer and Grid Comp., May 2003.

[11] S. Saroiu, P. K. Gummadi, and S. D. Gribble. A measurement study of peer-to-peer file sharing systems. In Proc.

Multimedia Computing and Networking (MMCN), Jan. 2002.

[12] P. Reynolds and A. Vahdat. Efficient peer-to-peer keyword searching. In Proc. ACM Conf. Middleware, 2003.

[13] PIRS Research Group Data, http://ir.iit.edu/~waigen/proj/pirs/data/

[14] M. Nilsson. Id3v2 web site. Web Document, 2006. www.id3.org.

[15] H. V. Jagadish, B. C. Ooi, K.-L. Tan, Q. H. Vu, and R. Zhang. Speeding up Search in Peer-to-Peer Networks with A Multi-way Tree Structure. In Proc. SIGMOD, 2006.

[16] Md. Mehedi Masud, Iluju Kiringa, Anastasios Kementsietsidis. Don't Mind Your Vocabulary: Data Sharing Across Heterogeneous Peers. In Proc. of the Intl. Conf. on

Coop. Inf. Sys. (CoopIS), 2005.

[17] S. Abiteboul, I. Manolescu and E. Taropa, A Framework for Distributed XML Data Management. In Proc. EDBT, 2006.

[18] E. Voorhees and D. Tice, "The TREC-8 Question Answering Track Evaluation," In Proc. of the Eighth Text REtrieval

Conference, 1999.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

T2.5K T5K T7.5K T10K

Threshold

MR

R

1

0.75

0.5

0.25

KISS: A Simple Prefix Search Scheme in P2P Networks∗

Yuh-Jzer [email protected]

Li-Wei [email protected]

Department of Information ManagementNational Taiwan University

Taipei, Taiwan

ABSTRACTPrefix search is a fundamental operation in information re-trieval, but it is difficult to implement in a peer-to-peer(P2P) network. Existing techniques for prefix search sufferfrom several problems: increased storage/index costs, unbal-anced load, fault tolerance, hot spot, and lack of some rank-ing mechanism. In this paper, we present KISS (Keytoken-based Index and Search Scheme), a simple and novel ap-proach for prefix search to overcome the above problems.

1. INTRODUCTIONPrefix search is a fundamental search operation in infor-

mation retrieval (IR). It allows users to retrieve desired ob-jects even though they have only partial information aboutthe objects. For example, a query like “comp*” can matchany objects with keywords computer, company, or competi-tor that have prefix comp. Prefix search is often used incombination with keyword search, for example, like “ACMSIG* proceedings” to search proceedings from all ACM spe-cial interest groups.

Prefix search is more general than keyword search, as thelatter can be viewed as a special case in prefix search. So asystem that supports prefix search can easily facilitate key-word search, but not vice versa. There are, however, sometechniques to extend keyword search to prefix search. Themost common way is to use the n-gram technique [16, 7, 9].This technique augments each keyword w with all its pre-fixes to describe an object associating with w. For example,in Fig. 1, object O1 has three keywords abc, abd, and ce.By expanding the keywords with their prefixes, we have thefollowing six keywords to describe the object: a, ab, abc,abd, c, and ce. Using this technique, a prefix query of “ab*”in Fig. 1 can be converted into an ordinary keyword searchwith query “ab” to retrieve the matching objects O1, O2,and O3.

To implement keyword search, an inverted index datastructure is typically used. The idea is to “reverse” the

∗This work was supported in part by the National ScienceCouncil, Taipei, Taiwan, Grants NSC 93-2213-E-002-096and NSC 94-2213-E-002-036.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

a, ab, abc a, ab, abd c, ce

a, ab, abc c, ce a, ad, adf

a, ab, abc a, ab, abd a, ad

prefix/ keyword

a ab abc abd ad adf c ce

List of Objects

O 1 , O 2 , O 3 O

1 , O

2 , O

3

O 1 , O

2 , O

3

O 1 , O 3 O 2 , O 3 O

2

O 1 , O

2

O 1 , O 2

Object O 1

term1 : abc term2 : abd term3 : ce

term1 : abc term3 : ce term5 : adf

term1 : abc term2 : abd term4 : ad

Object O 2

Object O 3

Figure 1: An inverted index of three objects.

object-keyword mapping. That is, for a given list of pairs(σ, K) specifying the set of keywords K associating witheach object σ, we build a list of pairs (w, O) specifying theset of objects O that have keyword w. For example, Fig. 1shows the inverted index list after keyword expansion. Withthe inverted index, keyword search is easy to accomplish:simply look up the list for the queried keyword, and returnthe object set associated with it. Some set operations mayalso be performed if necessarily. For example, suppose onewishes to retrieve objects with three prefixes in Fig. 1: ab*,ad*, and ce*. Then by looking up the entries for ab, ad, andce and then performing an intersection of the correspondingobject sets, the system returns the matching object O2.

The above approach, although commonly used in existingcentralized information systems, suffers from several seriousproblems when implemented in a distributed environment,in particular, P2P networks. First, keyword frequency—thecount of a keyword’s occurrences in objects—varies enor-mously. This has been observed in many real world cor-pora. So a straightforward way of distributing keyword-object pairs to peers in a P2P network would result in avery unbalanced load. The problem is magnified when tak-ing prefix search into account. For example, in English,many words have a common prefix, e.g., in-, co-, etc. Ina preliminary study, we examined three datasets and ob-tained the following statistics:1 the top 0.1% most frequentkeywords already account for 43.4% of the total occurrencesof the keywords, but after prefix expansion, the percentageincreases to 62.6%. Therefore, simply mapping each entryin an expanded inverted index to a node makes the indexingload extremely uneven.

Secondly, the n-gram technique will expand the (original)keyword set size by approximately l-fold, where l is the av-

1 The three datasets are: (1) half million of book entriesfrom a library, (2) 300,000 news articles from Reuters, and(3) half million music records from FreeDB. The statisticsis the average of them.

erage keyword length (note that some keywords may have acommon prefix). We see that even for keyword search only,“distributed” inverted index already makes object insert,delete, and maintenance very expensive. This is because ifan object has k keywords, then any insert/delete of the ob-ject has to access k nodes. When the keyword set size isexpanded to l × k, the maintenance cost increases by l-foldas well.

Third, although an object is indexed at several places,each prefix/keyword is still handled only by a single node. Soany failure to the node would then block all queries involvingthe prefix/keyword. The system is also vulnerable to hotspots, as nodes responsible for some popular prefixes maybe queried much more frequently than the others.

Finally, the above approach lacks some ranking mecha-nism to allow the system to return objects in sequence. Ingeneral, a short prefix query may result in many possiblematching items. One would certainly prefer some rankingmechanism to help select relevant objects. For example, aquery of “comp*” may return objects with keywords com-puter, company, or competitor, etc. However, for a nodethat indexes the prefix comp, it has no idea the actual key-word sets the matching objects have—unless the node alsomaintains their object-keyword information. The latter, un-fortunately, significantly increases the index cost.

In this paper, we present a simple yet effective index mech-anism for prefix search in P2P networks that eliminates theabove problems. We call our system KISS (Keytoken-basedIndex and Search Scheme), as it uses a novel technique to ex-tract keytokens—character-position information—from key-words to index objects over a hypercube. Below we presentthe index and search scheme, followed by some experimen-tal studies of the system. Related work and conclusions aregiven in Section 4 and Section 5, respectively.

2. KEYTOKEN-BASED INDEX AND SEARCHSCHEME

In this section we present KISS. We begin by present-ing a very fundamental process in KISS called tokenization,which extracts characters and their position information ina keyword. Then we describe how to index objects in a hy-percube using the character-position information extractedfrom their keywords. Based on this index scheme, we presentsearch mechanisms to retrieve objects.

2.1 TokenizationLet A be the set of alphabets in consideration. A keytoken

is a pair 〈c, i〉, where c ∈ A and i a nonnegative integer. Fornotational simplicity, we sometimes write 〈c, i〉 as ci whenno confusion is possible. Let W ⊂ A+ be the set of keywordsused in the system. For each keyword w ∈ W, we use w[i]to denote the ith character of w, and ||w|| to denote thelength of w. A tokenization is a function τ that extracts allkeytokens from a given keyword w; that is

τ (w) =n

〈w[i], i〉 | i ≤ ||w||o

We call τ (w) the keytoken set of w.For example, τ (webdb) = w1, e2, b3, d4, b5. Note that

the character positions in a valid keytoken set (a keytokenset is valid if it is extracted from a word) must be continuousand start from 1; i.e., 1, 2, . . .. Also note that given W, thenumber of all possible keytokens that can be extracted fromW is no greater than |A|× lmax, where lmax is the maximumlength of a keyword. We shall use T to denote the set of allpossible keytokens considered in the system.

keytoken space

range 1,... ,r

o object

keyword1

1 r-bit vector 1 1 1 0 1 0 0 0 0

a node in hypercube with node id=1011010001

tokenization

hash mapping

w 1 =bd w 2 =bade

a2 b1 d2 d3 e4

8 7 1

keyword2

10 5

Figure 2: The KISS index scheme.

2.2 Index SchemeOur index scheme is based on an r-dimensional hypercube

Hr(V, E). The hypercube can be constructed directly froma physical hypercube, or conceptually built on an underly-ing DHT-based P2P network G = (V ′, E′) (e.g., [19, 14]).To construct Hr(V, E) over G = (V ′, E′), we simply need amapping g : V → V ′ so that every logical node in the hyper-cube has a corresponding physical node in the network. Asmost DHT-based P2P networks have a mechanism to han-dle absence of nodes that are responsible for some identifierkeys, we assume the hypercube, on which our index schemeis based, is reliable and self-organizing. In the rest of thepaper by “nodes” we refer to the nodes in the hypercube.Each node has a unique r-bit binary string as its id. We useu[i], 1 ≤ i ≤ r, to denote the ith bit of u (counting from theright). Below we describe how objects are indexed at thenodes in the hypercube.

Recall that T is the set of all possible keytokens. Leth : T → 1, . . . , r be a hash function that uniformly andindependently maps every keytoken in T to an integer in1, . . . , r. We define a mapping Fh : 2T → V as follows:Fh(T ) = u if, and only if, i |u[i] = 1, 1 ≤ i ≤ r =h(t) | t ∈ T. In other words, Fh(T ) is the node with abinary id whose bits are set by h according to the keyto-kens in T . For example, suppose h(w1) = 3, h(e2) = 1, andh(b3) = 7, and r = 8. Then the keytoken set w1, e2, b3 ismapped to the node 01000101.

We say that a node u is responsible for a keytoken setT if Fh(T ) = u. Thus, for every possible set of keytokensin the system, there is exactly one node in the hypercuberesponsible for the set. Note that due to hash collision, anode may be responsible for more than one set of keytokens(as Fh(T ) might be equal to Fh(T ′) for some T 6= T ′).

To index objects at nodes, let σ be an object and Kσ bethe set of keywords associated with σ. Let Tσ = ∪w∈Kσ τ (w)be the set of keytokens extracted from the keywords of σ.Then, σ is indexed at the node u such that Fh(Tσ) = u. Itshould be clear that when an object σ is indexed at a nodeu, u needs to maintain, in addition to the keyword set of σ,the actual location information of σ (e.g., source IP, port,and file path of the object) from where σ can be retrieved.For simplicity, we sometimes say that a node u has indexeda keyword w if w belongs to a keyword set of some objectindexed at u. Fig. 2 illustrates the index scheme for anobject o that has two keywords bd and bade. The object isindexed at node 1011010001.

2.3 Search Strategies

(a)

0000 0001

0 1 00 0 1 01

0010

0 1 10 0 1 11

0011

1000

1 1 00 1 1 01

1001

1 1 10 1 1 11

1010 1011

000 001

010 011

100 101

110 111

(b) (c)

0 1 00 0 1 01

0 1 10 0 1 11

1 1 00 1 1 01

1 1 10 1 1 11

Figure 3: (a) H4, (b) H4(0100), (c) H3.

In this section we present search methods in KISS. We firstobserve that, due to our index scheme, pin search—given akeyword set K, locating the object that is associated exactlywith K—is straightforward. This is because given K, wesimply need to extract the keytoken set T = ∪w∈Kτ (w),and then find the node responsible for T , that is, the nodeFh(T ). The node Fh(T ), given its identifier, can be easilylocated using the routing scheme in the underlying DHTnetwork. From the node, by a local search of its index table,it can return the actual location of the object.

For example, in Fig. 2, to search objects with keyword setbd, bade, we first obtain the keytoken set T = a2, b1, d2,d3, e4, and then find the node Fh(T ) = 1011010001. Thenwe can issue a query to node 1011010001 to retrieve theobjects. Pin search is particularly useful when one wishesto locate an object for maintenance (e.g., updating its indexrecord), or when one has a precise description about histarget object.

We now turn our attention to prefix search, where given aprefix w, we need to retrieve objects that contain a keywordw′ such that w is a prefix of w′. We shall use ‘’ to denotethe prefix relation. So w w′ means that w is a prefixof w′. We begin by noting that if a node u is responsiblefor the prefix w, i.e., Fh(τ (w)) = u, then for every w′ suchthat w w′, every node that may index w′ must have anidentifier v satisfying the following condition: u[i] = 1 ⇒v[i] = 1, 1 ≤ i ≤ r. So, to search objects whose keywordsets contain prefix w, we need only to search nodes thathave bit ‘1’ at the positions that u also has, i.e., at thepositions h(t) | t ∈ τ (w). These nodes, in fact, form a“subhypercube” of Hr, as defined below:

Definition 2.1. Let Hr = (V, E) be a hypercube and u ∈V be a node. A subhypercube induced by u, denoted byHr(u), is a subgraph G = (U,F ) of Hr such that every nodev ∈ V is in U if and only if u[i] = 1 ⇒ v[i] = 1, and everyedge e ∈ E is in F if and only if its two end points are inU .

Fig. 3 illustrates H4(0100) induced by node 0100 in H4,which is isomorphic to H3.

So, given a prefix w, nodes which may index a keywordw′ with prefix w must be in the subhypercube induced byFh(τ (w)). For example, in Fig. 2, since Fh(b1, d2) =0011000000, all nodes that may index a keyword beginningwith bd must have an identifier like xx11xxxxxx, and all suchnodes are in the subhypercube induced by node 0011000000.

To explore nodes in a hypercube, we recall that nodes in ahypercube can be traversed via a spanning binomial tree [10].There are many ways to construct a spanning binomial tree.For example, Fig. 4 illustrates two spanning binomial treesof the hypercube induced by node 0100. For our purpose,we need a spanning binomial tree that can assist rankingand even reduce the search space. For example, considerFig. 5, which shows a spanning binomial tree of the sub-hypercube induced by node 010100. The root node 010100is responsible for the keytoken set a1, b2. Nodes 011100,

(a)

0 1 00 0 1 01

0 1 10 0 1 11

1 1 00 1 1 01

1 1 10 1 1 11

1 1 1 1

1 1 00

0 1 00

0 1 1 0 0 1 0 1

1 1 1 0 1 1 01 0 1 1 1

(b) (c)

1 1 1 1

0 1 10

0 1 00

1 1 0 0 0 1 0 1

1 1 1 0 0 1 11 1 1 0 1

Figure 4: H4(0100) (a) and its two spanning binomialtrees (b) and (c).

0 1 0 1 0 0

1 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1

0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0

1 1 0 1 1 1 1 1 1 1 0 1

1 1 0 1 0 1 1 1 0 1 1 0

0 1 1 1 1 1 1 1 1 1 1 0

1 1 1 1 0 0

1 1 1 1 1 1

a1 b2

c3 d3 d4 e5

d3 d4 d4

order: a1, b2, c3, d3, d4, e5

X X

X X

X

1 2 3 4 5 6

ab abc abd abcd abdd abcde abdde

ab abc abd abcd abdd abcde abdde

BFS output DFS output

e5 e5

e5 e5

e5

e5

e5

d4

Figure 5: Search in a spanning binomial tree.

010110, 110100, and 010101 are the first level nodes in thetree. We identify the keytoken that sets a correspondingbit to one on top of the bit. For example, keytoken c3 setsbit 4 to one. We note that the tree is arranged so that ifwe explore the tree in a breadth-first style (BFS), we canfirst locate objects with keyword ab, then with keyword abc,then with keyword abd, and so on. In contrast, a depth-firstsearch (DFS) will first locate objects with keyword ab, thenwith abc, then with abd, and so on. It can be seen that,with a few exceptions, keywords searched in BFS are sortedin length and then in alphabetic order, while DFS retrieveskeywords sorted in alphabetic order and then in length (i.e.,in lexicographical order).

Another important feature to observe is that not all of thenodes need to be searched. For example, the node 110100cannot possibly index any object. To see this, recall thatthe character positions in a valid keytoken set must be con-tinuous and start from 1. For a keytoken set T , we de-fine a function pos to project the positions in T ; that is,pos(T ) = i | 〈x, i〉 ∈ T. For a given node u in our indexhypercube, let Tu be the maximal keytoken set responsi-ble by u. Then, u cannot index a keyword w of length l ifpos(Tu) does not contain 1, 2, . . . , l. The node 110100 isresponsible for the keytoken set a1, b2, d4. However, sinceit does not contain a keytoken with position 3, it cannot in-dex a keyword of length greater than two. In fact, the nodecannot even index any object, as any object with a keytokenset a1, b2 will be indexed at node 010100, not at 110100.Likewise, its child node 110101, which is responsible for thekeytoken set a1, b2, d4, e5, cannot index any object, ei-ther. Such kind of nodes that cannot possibly index anyobject are called null nodes. They allow us to reduce thehypercube search space.

To construct a spanning binomial tree like Fig. 5, we needsome ordering on keytokens in T . Assume the alphabets inA is totally ordered by a relation ‘≤’. Then, we define a

total order ‘≤’ over keytokens in T as follows:

〈a, i〉 ≤ 〈b, j〉 iff i < j ∨ (i = j ∧ a ≤ b)

For example, let A be the English letters with alphabeticalordering. Then a1 ≤ b2 ≤ c2.

The total order allows us to determine which node tochoose first when “spanning” a binomial tree from a givenroot node. The complete definition is somewhat complex.Due to space limitation, it will be given in the full paper.

2.4 Guided Depth-First SearchThe hypercube search space can be further reduced by

noting that some keytokens are highly correlated. For ex-ample, in English, many words begin with de∗, so d1 hashigh probability to occur with e2 simultaneously. Similarly,t4, i5, o6, and n7 are highly correlated (representing words???tion). To investigate correlation between keytokens, weanalyzed four large databases from the real world (see Foot-note 1) and all found the correlation feature among theirkeytokens. This property allows us to cluster highly cor-related keytokens so that we can first explore them whentraversing within a spanning binomial tree. The correlationallows us to have a higher probability to retrieve objects thanexploring other keytokens, and therefore to reduce searchcosts.

Based on this, we develop a search strategy called GuidedDepth-First Search (GDFS) to explore spanning bino-mial trees. When a node u is to traverse its children nodesin a depth-first style, it first asks its children nodes in thenext level to see how many matched objects they have. Thennode u collects the results, sorts them in a descending order,and then uses the order to decide which child node to tra-verse first. Each internal node in the tree uses the schemerecursively to traverse its subtree. Note that keytoken cor-relation is relatively stable. So the next-level information ofa node can be cached at the node and need not be inquiredin every search operation.

2.5 SummaryWe summarize some important features of KISS. First, we

observe that, even though we have extracted keytokens fromthe keyword set of an object and use the keytoken set to de-termine an index node for the object, the index node is stilldetermined uniquely by an object’s keyword set. So even ifseveral objects all contain some popular keywords, and thekeywords share some common prefixes with other keywords,the keytoken sets of these objects are still likely to differin some way, and so the objects are likely to be indexedby more than one node. The more the popularity of theprefixes/keywords, the more the number of objects contain-ing these prefixes/keywords, and so the more the number ofnodes to index the objects. So indexing load can be bal-anced even if prefix/keyword frequency follows Zipf’s lawwith sharp slope. Moreover, since there are a number ofnodes to index a prefix/keyword, no single node failure candeny all queries involving the prefix/keyword.

Secondly, each object is indexed by only one node, regard-less of how many keywords it has. So, unlike distributedinverted index, the scheme does not introduce extra cost toindex an object. Object insert, delete, and pin search there-fore take only one lookup operation in the P2P overlay, asopposed to about k× l operations needed by the distributedinverted index in combination with the n-gram techniqueintroduced in Section 1, where k is the number of keywordsan object has, and l is the average keyword length. Repli-cation certainly helps increase fault tolerance (at the cost ofextra storage and consistency maintenance), but this is upto the applications. If one wishes, replication can be done in

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

1 3 5 7 9 11 13 15 17 19 21 23

Size of keyword set

Num

ber

of o

bjec

ts

0%

20%

40%

60%

80%

100%

Cum

lativ

e pe

rcen

tage

0

1000

2000

3000

4000

5000

6000

7000

8000

1 5 9 13 17 21 25 29 33 37 41 45 49

Size of keytoken set

Num

ber

of o

bjec

ts

0%

20%

40%

60%

80%

100%

Cum

lativ

e pe

rcen

tage

Figure 6: The distribution of the sizes of keywordset (left) and keytoken set (right).

two ways. One is to deal with it directly in the index layer,for example, by building a secondary hypercube. The otheris to assume this function as part of the underlying DHToverlay, as many existing DHT overlays already have theirtechniques for replication and fault tolerance.

Finally, objects in KISS are easily distinguished by thecharacters and the positions of them in the keyword setsthey associate. In particular, given a prefix w, one can easilylocate nodes that index a word w concatenated with onemore specific character, with two more specific characters,and so on. So by exploring different search style, KISS caneasily facilitate some kind of ranking on objects, e.g., bysorting the matching keywords in lexicographical order orin alphabetical order.

3. EXPERIMENTAL RESULTSIn this section we present experimental results for evaluat-

ing the performance of KISS. We use the the website recordscollected at PChome (http://www.pchome.com.tw, a localportal in Taiwan) as our dataset. The dataset consists of131,180 website records in Chinese, where each record con-tains the following six fields: ID, Title, URL, Category, De-scription, and Keyword. We chose the dataset because Chi-nese is character-based and so prefix search is particularlyimportant.

Each record is treated as an object to be indexed in KISS,and the set of words in the Keyword field is treated as thekeyword set associating with the object. We decompose eachkeyword in the keyword set into keytokens as described inSection 2.1. The distribution of keyword set sizes is shownin Fig. 6, left. On average, each object is associated with 6.3keywords, and there are a total of 117,320 distinct keywords.The distribution of keytoken set sizes after tokenization isalso shown in Fig. 6. On average each object is associatedwith 15 keytokens, and so the average size of keytoken setincreases to about 2.4 times of the size of a keyword set.

Before the experiments, we need to determine the dimen-sionality r of the hypercube in KISS. It can be seen that ifr is too small, then many different keytokens are likely tobe hashed to the same bit position, and so the collision willbe high. On the other hand, a large r may result in a hugebut “sparse” index hypercube, as there will be 2r nodes,but many of them have little index load. This makes thesearch costs considerably high. So selecting an appropriater is crucial to the performance of the system. A detailedanalysis of the choice of r will be provided in the full paper.In the experiment we set r to 16.

In the first experiment we measure load distribution ofKISS. We built a hypercube of dimension r = 16, and as-signed each object in our data set a node in the hypercuberesponsible for indexing the object. Then we rank the loadof nodes from heavy to light, and determine the percent-

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Cumulative Percentage of Nodes

Cum

ulat

ive

Per

cent

age

of L

oad

DII-16

KISS-16

DHT-16

Figure 7: Load distribution.

age of objects each node handles. The results are shown inFig. 7 on the line marked as “KISS-16”. For comparison,we observe that most P2P networks use some hash func-tions (e.g., SHA-1) to distribute objects to nodes, and theresults are generally considered as “well balanced”. So wealso draw the load distribution that simply uses a hash func-tion to distribute objects to nodes in the hypercube. Theline is marked as “DHT-16” as a benchmark. Note thatthe scheme is used only for reference; it cannot support anysearch operation.

In addition, we also studied the load distribution of thedistributed inverted index (DII) scheme discussed in Sec-tion 1. For this, we expand each keyword w with ||w|| − 1additional keywords taken from all possible prefixes of w.Then we built an inverted index list of the keywords (seeFig. 1). We then hash all the keywords to nodes to deter-mine which node is responsible for indexing which keyword.The node that is responsible for a keyword then maintainsthe list of objects that have the keyword. Note that in thisscheme an object may be indexed by more than one node.We measure a node’s load by the number of objects it in-dexes as a percentage of the sum of the numbers of objectsindexed by all nodes. The load distribution of this schemeis also drawn in Fig. 7 by the line “DII-16”.

From the figure we see that DII results in an extremelyunbalanced load. The top 10% nodes account for 85% ofthe system’s total loads! In contrast, KISS is significantlybetter than DII in terms of load balance. However, there issome room between KISS and DHT. This is because in KISS,nodes that have few bit-1’s are likely to index no objects.Future work will consider how to let the nodes share loadswith others.

In the second experiment, we study prefix search perfor-mance in KISS. We measure the number of nodes need tobe visited with respect to a given recall rate. To issue thequeries, we use query logs collected at PChome, and ran-domly sample some of them as our experimental queries.There are not many prefix queries, though. Therefore, weuse the following method to convert real-life keyword queriesto prefix queries: For each sampled query q, let Kp be theset of keywords issued in the query. If Kp is a singleton,we extract the prefix of length m from the keyword as ourprefix query, where m is a simulation parameter to be usedlater. If, however, Kp contains more than one keyword, say,w1, . . . , wk, then we averagely take prefixes of them, sayw′

1 w1, . . . , w′k wk, so that the total length of the pre-

fixes w′1, . . . , w

′k is m. Nevertheless, a majority of the queries

are single-keyword queries.We vary prefix length m = 2, 3, 4, 5, 6, and for each m

we use 500 sample prefix queries of the length to evaluatesearch performance of KISS under the following three searchstrategies: BFS, DFS, and GDFS+tree pruning (recall that

a spanning binomial tree can be pruned to eliminate searchpaths from null nodes). The results are shown in Fig. 8.

From the results we see that, in general, the longer thequeried prefix length, the smaller the number of nodes needto be visited. We note that each line is composed of somesegments. For example, the result of m = 2 in BFS has acurve from (0,0) to about (92%,25%), then to (100%,50%).To explain this, observe that the size of a spanning binomialtree induced by a node u is 2r−n, where n is the number ofbit-1’s u has. Let Tq be the keytoken set extracted from aquery q. Then the search space of q is determined by thenumber of different values h(t) maps to, t ∈ Tq (see Sec-tion 2.2). If |h(t) | t ∈ Tq| = n, then 2−n of the totalnodes need to be searched in the worst case. For a prefixquery q of length m, where m is small, it has high probabil-ity that Tq (which consists of m keytokens) will be mappedto a node with m bit-1’s. So the search space is 2−m ofthe network size. Since index load in KISS is sort of bal-anced, the number of nodes visited grows in proportion tothe recall rate. For example, if m = 2, then in the worstcase 25% of nodes need to be visited to reach 100% recallrate. However, due to hash collision, some prefix of length2 may be mapped to a root node with only one bit-1. Forthese queries, the search space vs. recall-rate line grows from(0, 0) to (50%,100%). By combining the two possible casesfor m = 2, we have the two-segment curve in Fig. 8. Theother cases for different m are similar.

We also note that on average GDFS with tree pruningperforms much better than DFS, which performs much bet-ter than BFS. It is not surprised to see that GDFS with treepruning outperforms both DFS and BFS (see Section 2.4).To see why DFS outperforms BFS, recall that a node withfewer bit-1’s is less likely to index an object as compared toa node with more bit-1’s. So when m is small, nodes at thebottom of a spanning binomial tree to be searched tend toindex more objects than nodes at the upper levels. So BFScauses more “empty” nodes to be visited in the early stageof search than DFS does. So BFS requires more nodes to bevisited than DFS with respect to the same recall rate.

4. RELATED WORKSearch has been one of the key issues in fully decentralized

P2P networks. When the network is Gnutella-like unstruc-tured, each node often locally maintains objects it sharesto the network. Since query messages are passed aroundnodes to check if they have the desired targets, a variety ofsearches (including, of course, prefix search) can be offered.However, since search is basically a blind process traversingaround the network, in general, search space size grows inproportion to recall rate. In the worst case, the entire net-work needs to be searched to retrieve an object. So searchis unscalable. The work of [12, 4] focuses specifically onkeyword search in unstructured P2P networks.

Structured P2P networks like DHTs basically support onlyexact name match, as objects are given a unique identifierobtained by hashing their names to determine their locationsin the network. Keyword search must be built on top of theoverlay to enhance search functionality. Several mechanismshave been proposed for keyword search in DHTs (e.g., [15,18, 6, 11]), but all of them, except [11] on which our workis built, use inverted index as the primary data structure.Moreover, they do not consider prefix search.

In addition to keyword search, range queries in struc-tured P2P networks have recently attracted much attention(e.g., [2, 17, 8, 1, 13, 5]). Range queries concern search ofobjects with keys in a given range. They are useful, forexample, when one wishes to search computing resources

BFS

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall Rate

Per

cent

age

of N

odes

Vis

ited m=2

m=3

m=4

m=5

m=6

DFS

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall Rate

Per

cent

age

of N

odes

Vis

ited m=2

m=3

m=4

m=5

m=6

GDFS with tree pruning

0%

5%

10%

15%

20%

25%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall Rate

Per

cent

age

of N

odes

Vis

ited

m=2

m=3

m=4

m=5

m=6

Figure 8: (colored) Query performance of (a) BFS(b) DFS, and (C) GDFS. Note that BFS is on alarger scale than the other two.

with CPU clock ranging in between 2-4GHz. Range queriesare related to prefix search, as the latter can be translatedinto a range query covering all objects with a given pre-fix. There is subtle difference, however. Range queries oftenconcern one attribute at a time, and the research focus ison how to efficiently retrieve adjacent keys as well as to bal-ance load when key distribution may skew. Multi-attributerange queries [3], e.g., “search for computing resources with2-4GHz CPU clock, at least 1GB RAM, and disk space 80-200GB,” are considered more complicated, if one does notwish to tackle the problem one dimension at a time andthen take a join operation of the results. In contrast, pre-fix/keyword search naturally allows multiple words to be is-sued and, unlike distributed inverted index, our hypercubeindex scheme is able to tackle a query without the need ofa join operation among some potentially large object sets.

5. CONCLUSIONSWe have presented a simple and novel keytoken-based in-

dex and search scheme, KISS, for prefix search in struc-tured P2P networks. The idea is to tokenize every key-word associated with an object, and then hash the keyto-ken set into an r-bit vector. This r-bit vector representsa node in an r-dimensional hypercube. By mapping theconceptual hypercube to the underlying DHT network, wecan locate the object using only one lookup operation in

the network. Therefore, object insert, delete, and mainte-nance can be done very efficiently and effectively. Moreover,the index scheme allows nodes to evenly distribute theirloads in indexing popular keywords and prefixes. It also al-lows us to retrieve words that grow in length incrementallyand alphabetically—a key feature for implementing prefixsearch. This feature also supports some form of ranking as,by applying different exploring strategies, different orderingof objects can be retrieved. In contrast, a traditional ap-proach that uses distributed inverted index in combinationwith n-gram technique suffers from extremely unbalancedload, high maintenance costs, and unable to support rank-ing.

6. REFERENCES[1] J. Aspnes, J. Kirsch, and A. Krishnamurthy. Load

balancing and locality in range-queriable data structures.In PODC 2004, pages 115–124.

[2] B. Awerbuch and C. Scheideler. Peer-to-peer systems forprefix search. In PODC 2003, pages 123–132.

[3] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury:Supporting scalable multi-attribute range queries. InSIGCOMM 2004, pages 353–366.

[4] H. Cai and J. Wang. Foreseer: a novel, locality-awarepeer-to-peer system architecture for keyword searches. InMiddleware 2004, LNCS 3231, pages 38–58.

[5] A. Datta, M. Hauswirth, R. John, R. Schmidt, andK. Aberer. Range queries in trie-structured overlays. InP2P 2005.

[6] P. Ganesan, Q. Sun, and H. Garcia-Molina. Adlib: Aself-tuning index for dynamic peer-to-peer systems. InICDE’05, pages 256–257.

[7] G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. Newindices for text: Pat trees and pat arrays. Informationretrieval: data structures and algorithms, pages 66–82,1992.

[8] A. Gupta, D. Agrawal, and A. E. Abbadi. ApproximateRange Selection Queries in Peer-to-Peer Systems. In CIDR2003, 2003.

[9] M. Harren, J. M. Hellerstein, R. Huebsch, B. T. Loo,S. Shenker, and I. Stoica. Complex queries in DHT-basedpeer-to-peer networks. In IPTPS 2002, pages 242–259.

[10] S. L. Johnsson and C.-T. Ho. Optimum broadcasting andpersonalized communication in hypercubes. IEEETransactions on Computers, 38(9):1249–1268, 1989.

[11] Y.-J. Joung, C.-T. Fang, and L.-W. Yang. Keyword searchin DHT-based peer-to-peer networks. In ICDCS 2005,pages 339–348.

[12] K. Nakauchi, Y. Ishikawa, H. Morikawa, and T. Aoyama.Peer-to-peer keyword search using keyword relationship. InGP2PC 2003, pages 359–366.

[13] S. Ramabhadran, S. Ratnasamy, J. M. Hellerstein, andS. Shenker. Brief announcement: Prefix hash tree. InPODC ’04, pages 368–368.

[14] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Shenker. A scalable content-addressable network. InSIGCOMM 2001, pages 161–172.

[15] P. Reynolds and A. Vahdat. Efficient peer-to-peer keywordsearching. In Middleware 2003, pages 21–40.

[16] G. Salton. Automatic Text Processing. TheTransformation, Analysis and Retrieval of Information byComputer. Reading, MA: Addison-Wesley, 1988.

[17] C. Schmidt and M. Parashar. Enabling flexible queries withguarantees in P2P systems. IEEE Internet Computing,8(3):19–26, 2004.

[18] S. Shi, G. Yang, D. Wang, J. Yu, S. Qu, and M. Chen.Making Peer-to-Peer Keyword Searching Feasible UsingMulti-level Partitioning. In IPTPS 2004, pages 151–161.

[19] I. Stoica, R. Morris, D. R. Karger, M. F. Kaashoek, andH. Balakrishnan. Chord: A scalable peer-to-peer lookupservice for Internet applications. In SIGCOMM 2001, pages149–160.

Global Document Frequency Estimationin Peer-to-Peer Web Search

Matthias Bender ? , Sebastian Michel ? , Peter Triantafillou , Gerhard Weikum ?

? Max-Planck-Institut fur Informatik RACTI and University of Patras

66123 Saarbrucken, Germany 26500 Rio, Greece

mbender,smichel,[email protected] [email protected] retrieval (IR) in peer-to-peer (P2P) networks,where the corpus is spread across many loosely coupledpeers, has recently gained importance. In contrast to IRsystems on a centralized server or server farm, P2P IR facesthe additional challenge of either being oblivious to globalcorpus statistics or having to compute the global measuresfrom local statistics at the individual peers in an efficient,distributed manner. One specific measure of interest is theglobal document frequency for different terms, which wouldbe very beneficial as term-specific weights in the scoring andranking of merged search results that have been obtainedfrom different peers.

This paper presents an efficient solution for the problemof estimating global document frequencies in a large-scaleP2P network with very high dynamics where peers can joinand leave the network on short notice. In particular, thedeveloped method takes into account the fact that the lo-cal document collections of autonomous peers may arbitrar-ily overlap, so that global counting needs to be duplicate-insensitive. The method is based on hash sketches as atechnique for compact data synopses. Experimental stud-ies demonstrate the estimator’s accuracy, scalability, andability to cope with high dynamics. Moreover, the benefitfor ranking P2P search results is shown by experiments withreal-world Web data and queries.

1. INTRODUCTIONIn recent years, distributed information retrieval systems

based on Peer-to-Peer (P2P) architectures are increasinglyreceiving attention [27, 29, 22, 1, 21, 31, 4, 13, 39]. The P2Papproach offers the ability to handle huge amounts of datain a highly distributed, self-organizing way and, thus, of-fer enormous potential for search engines powerful in termsof scalability, efficiency, and resilience to failures and dy-namics. Additionally, such a search engine can potentiallybenefit from the intellectual input (e.g., bookmarks, querylogs, etc.) of a large user community. Finally, but perhapseven more importantly, a P2P web search engine can alsofacilitate pluralism in informing users about internet con-tent, which is crucial in order to preclude the formation ofinformation-resource monopolies and the biased visibility of


content from economically powerful sources.

1.1 ProblemGiven the large-scale data distribution, one of the key

technical challenges is result merging, i.e., the process of ef-fectively combining local query results from different sources.While document scoring and ranking is a challenging prob-lem already in centralized systems, additional difficulty ina distributed environment stems from the fact that mostof the popular document scoring models, such as tf*idf or[35], use collection-specific statistical information for thispurpose. Most prominently, both use document frequencies(df), i.e. the number of documents in the collection thatcontain a query term1. The local usage of collection-specificdf values in these scoring models result in document scoresthat are incompatible across collections and, thus, make re-sult merging difficult. On the other hand, if global df valuescould be applied, the document scoring and ranking wouldbe ideal in the sense that it would be identical to the doc-ument ranking that would be produced by a hypotheticalcombined collection.

Early research on distributed information retrieval sys-tems typically assumed disjointly partitioned collections. Insuch a setting, the global df value is simply the sum overall local df values. Instead, we envision autonomous peersthat independently gather thematically focused collectionsthrough web crawls or similar techniques. In such a setting,studies show a skewed distribution of documents across thecollections, with popular documents contained in a largefraction of collections. Thus, summing up the df valuesacross collections would inevitably lead to biased df values(and, thus, document scores) [24], as popular documentsare repeatedly accounted for. Additionally, thematically fo-cused collections show a high variance of df values for thesame term (whereas randomly partitioned collections show arather uniform distribution of df values for the same term).This further increases the necessity of a score normalizationacross peers.

1.2 ContributionWe present a robust and scalable approach towards esti-

mating global df values using hash sketches [16]. We studythe general accuracy of hash sketches when used as synopsesto estimate document frequencies and we develop an effi-cient strategy to combine these hash sketch synopses acrosscollections in a way that does not incur any additional er-ror from combining them. We show the superiority of ourglobal df estimation technique compared to other techniques

1Note the difference to the notion of peer or collection fre-quencies that estimate the number of collections that con-tain a query term. The document frequency, instead, repre-sents the total number of distinct documents that contain aterm.

and present experimental evidence of the effectiveness im-provements in result merging stemming from this improvedknowledge. The experiments are conducted on real-wordweb data using our fully operational P2P Web search en-gine prototype.

The rest of the paper is organized as follows. Section 2discusses related work and gives general background on P2PIR. Section 3 introduces hash sketches as our technique formultiset cardinality estimation and discusses their generalapplication in our distributed environment. Section 4 in-troduces the design fundamentals that serve as a basis forour approach and discusses the extensions necessary to sup-port an overlap-aware global df estimation in the presenceof peers entering and leaving the system without prior no-tice. Section 5 presents an experimental evaluation of thegeneral accuracy of hash sketches as cardinality estimatorsand of the accuracy of our approach from different angles.Finally, Section 6 concludes this work and points at futureresearch directions.

2. RELATED WORK

2.1 Estimating Set CardinalitiesEstimating overlap of sets has been receiving increasing

attention for modern emerging applications, such as datastreams, internet content delivery, etc. [6] describes a per-mutation-based technique for efficiently estimating set sim-ilarities for informed content delivery. [18] proposes a hash-based synopsis data structure and algorithms to supportlow-error and high-confident estimates for general set ex-pressions. Bloom [5] describes a data structure for succinctlyrepresenting a set in order to support membership queries;[10] present an extension for dealing with multisets, but stillfocuses on membership queries rather than cardinality esti-mation. [22] proposes a gossip-based protocol for comput-ing aggregate values in a fully decentralized fashion. [28]addresses communication topology issues for distributed ag-gregation and identifying frequent items in a network. [11]develops a sketch-based framework for distributed estima-tion of query result cardinalities, but does not consider du-plicates. None of [22, 28, 11] addresses the elimination ofoverlap.

Recently and independently, [12] proposed an approachsimilar to ours which aims at duplicate-insensitive countingin sensor networks. However, they have not published anyresults regarding the resilience to churn of their method, norhas it been applied to an IR scenario.

2.2 Peer-to-Peer ArchitecturesRecent research on P2P systems, such as Chord [38], CAN

[34], Pastry [36], or P-Grid [2] is typically based on vari-ous forms of distributed hash tables (DHTs) and supportsmappings from keys, e.g., titles or authors, to locations ina decentralized manner such that routing scales well withn, the number of peers in the system. Typically, an exact-match key lookup can be routed to the proper peer(s) in atmost O(log n) hops, and no peer needs to maintain morethan O(log n) routing information. These architectures canalso cope well with failures and the high dynamics of a P2Psystem as peers join or leave the system at a high rate andin an unpredictable manner.

2.3 Distributed IR and Web SearchMany approaches have been proposed for distributed IR,

most notably, CORI [9], the decision-theoretic framework[32], the GlOSS method [19], and methods based on statis-tical language models [37]. Recently, there has been researchtowards overlap aware resource selection methods [20, 3, 30]that consider the mutual overlap between peers during the

selection process but do not consider global df estimation.Galanx [40] is a P2P search engine implemented using

the Apache HTTP server and BerkeleyDB. The Web siteservers are the peers of this architecture; pages are storedonly where they originate from, thus forming an overlapfree network. PlanetP [13] is a publish-subscribe service forP2P communities, supporting content ranking search. Theglobal index is replicated using a gossiping algorithm. Odis-sea [39] assumes a two-layered search engine architecturewith a global index structure distributed over the nodes inthe system. It actually advocates using a limited number ofnodes, in the spirit of a server farm. None of this prior workconsider the problem of estimating the global df value, forpeers with overlapping local contents.

2.4 Result MergingFor cooperative environments Kirsch’s algorithm [23] pro-

poses to collect local statistics from the selected databasesto normalize document scores. [26, 29] uses a centralizeddatabase of collection samples, which is incompatible withour architectural vision and seems infeasible in the presenceof high network dynamics. [7] gives an overview of algo-rithms for distributed IR style result merging and databasecontent discovery. None of the presented techniques incor-porates overlap detection between the peers into the mergingprocess.

Result merging techniques for topically organized collec-tions were studied in [24]. Experiments showed that globalidf scores is the most desirable method, but they consid-ered neither real-world Web pages nor overlap between col-lections. [33] incorporates an estimated number of globaloccurrences of the same document into the result mergingprocess, but does not estimate the global number of docu-ments that contain a specific term.

3. MULTISET CARDINALITY ESTIMATIONUSING HASH SKETCHES

Estimating the global document frequency for a giventerm would be straightforward if peers had pair-wise dis-joint local collections. The global collection is the union ofall local collections, and the disjointness would allows us tosimply sum up all local document frequencies for the sameterm. We will discuss the resulting communication and sys-tem aspects in Section 4. However, with non-disjoint lo-cal collections, computing their union essentially producesa multiset (bag) with duplicates. If we had the full doc-ument ids of all items in the multiset, we could eliminateduplicates by sorting or hashing and subsequently count thedistinct items. But this approach is expensive on large mul-tisets with all documents explicitly represented. We wouldrather prefer an approach where each local collection is rep-resented by a compact synopsis, with a small and control-lable approximation error.

This section introduces such a synopsis, namely, hashsketches [16], and shows how to employ them for our goal.When we form the union of several synopses, originatingfrom different peers, we face again the problem of how todiscount duplicates in the mulitset synopses. We will showin this section how this duplicate-aware multiset-countingproblem is elegantly solved by our approach based on hashsketches, and we demonstrate the low approximation error.

3.1 Hash SketchesHash sketches were first proposed by Flajolet and Martin

in [16] to probabilistically estimate the cardinality of a mul-tiset S. [18] proposes a hash-based synopsis data structureand algorithms to support low-error and high-confident es-timates for general set expressions. Hash sketches rely onthe existence of a pseudo-uniform hash function h() : S →

[0, 1, . . . , 2L). Durand and Flajolet presented a similar al-gorithm in [15] (super-LogLog counting) which reduced thespace complexity and relaxed the required statistical prop-erties of the hash function.

Briefly, hash sketches work as follows. Let ρ(y) : [0, 2L) →[0, L) be the position of the least significant (leftmost) 1-bitin the binary representation of y; that is,

ρ(y) = mink≥0

bit(y, k) 6= 0, y > 0

, and ρ(0) = L. bit(y, k) denotes the k-th bit in the binaryrepresentation of y (bit-position 0 corresponds to the leastsignificant bit). In order to estimate the number n of dis-tinct elements in a multiset S we apply ρ(h(d)) to all d ∈ Sand record the least-significant 1-bits in a bitmap vectorB[0 . . . L − 1]. Since h() distributes values uniformly over[0, 2L), it follows that P (ρ(h(d)) = k) = 2−k−1.

Thus, when counting elements in an n-item multiset, B[0]will be set to 1 approximately n

2times, B[1] approximately

n4

times, etc. Then, the quantity R(S) = maxd∈Sρ(d) pro-vides an estimation of the value of log2 n. The authors in[16, 15] present analyses and techniques to bound from abovethe error introduced.Techniques which provably reduce thestatistical estimation error typically rely on employing mul-tiple bitmaps for each hash sketch, instead of only one. Theoverall estimation then is an averaging over the individualestimations produced using each bitmap.

3.2 Combining Hash SketchesHash sketches offer duplicate elimination ”for free”, or in

other words, they allow counting distinct elements in mul-tisets. Estimating the number of distinct elements (e.g.,documents) of the union of an arbitrary number of multi-sets (e.g., distributed and autonomous collections) - eachrepresented by a hash sketch synopsis - is easy by design: asimple bit-wise OR-operation over all synopses yields a hashsketch for the combined collection that instantly allows us toestimate the number of distinct documents of the combinedcollection.

More formally, we can derive the following distributivitytheorem:

Theorem 1. Let β(S) be the set of bit positions ρ(h(d))for all d ∈ S. Then β(S1 ∪ S2) = β(S1) ∪ β(S2).

The proof follows directly from the definitions of ρ andβ. The corresponding bit in the resulting combined hashsketch will be set if and only if at least one of the elementsin one of the original sets had set this bit. Particularly noticethat, if more than one set holds this element, the elementwill conceptually be counted only once, effectively removingduplicates.

3.3 Application to Global DF EstimationThe above methods can be employed for the purpose of

global df estimation as follows. Assume that each peer,given its collection, prepares hash sketches, one for each setof documents that contain a specific term (i.e., its indexlist for that term). The network-wide combination of allhash sketches for a specific term, thus, yields an estimatefor the number of distinct elements in the union of the setsrepresented by their synopses, i.e., for the number of distinctdocuments that contain the given term. This is the globaldocument frequency for that term.

4. OVERLAP-AWARE DF ESTIMATION

4.1 Design Fundamentals

We have implemented MINERVA2, a fully operationalP2P Web search engine building on the following design fun-damentals [3, 30, 4]. We consider a P2P network in whichevery peer is autonomous and has a local index that canbe built from the peer’s own crawls or imported from ex-ternal sources and tailored to the user’s thematic interestprofile. The index contains inverted lists with URLs forWeb pages that contain terms. A conceptually global butphysically distributed directory, which is layered on top ofa distributed hash table (DHT), holds only very compact,aggregated meta-information about the peers’ local indexesand only to the extent that the individual peers are willingto disclose. As part of the DHT, every peer is responsiblefor the meta-information of a randomized subset of termswithin the global directory. For failure resilience and avail-ability, the entry for a term may be replicated across mul-tiple peers. The DHT offers a lookup method to determinethe peer responsible for a particular term.

Every peer publishes per-term summaries (Posts) of itslocal index to the directory. The DHT hash function deter-mines the directory peer currently responsible for this term.This peer maintains a PeerList of all Posts for this term fromacross the network. Posts contain contact information aboutthe peer who posted this summary together with statisticsto calculate IR-style measures for a term (e.g., the size of theinverted list for the term, the maximum and average scoreamong the term’s inverted list entries, or some other statisti-cal measure). These statistics are used to support the queryrouting process, i.e., determining the most promising peersfor a particular query. To deal with the high dynamics in aP2P network, each Post is assigned a Time-to-Live (TTL)value. If the originator peer has not updated (refreshed) itsPost after this time interval, it is discarded.

The querying process for a multi-term query proceeds asfollows: the query initiator retrieves a list of potentially use-ful peers by issuing a PeerList request for each query termto the underlying overlay network. A number of promis-ing peers for the complete query is computed from thesePeerLists. Subsequently, the query is forwarded to thesepeers and executed based on their local indexes. Finally, theresults from the various peers are combined at the queryingpeer into a single result list; this step is referred to as resultmerging and would enormously benefit from the knowledgeof global df values.

Note that this design is DHT agnostic, since it utilizesonly a DHT’s lookup/routing function and thus enjoys wideapplicability.

4.2 Accommodating DF MetadataGiven the system design introduced above with a hash-

based assignment of terms to responsible directory peers, itis very natural for these directory peers to maintain addi-tional data that supports the global df estimation for theterms they are responsible for. When publishing the term-specific Posts about the local collection, we propose thatevery peer includes a hash sketch representing its index listfor the respective term in its (term-specific) Post, so thateach directory peer can compute an estimate for the globaldf values for the terms it is responsible for using the com-bination method introduced in Section 3.2. Thus, the hashsketch synopses representing the index lists of all peers fora particular term are all sent to the same directory peerresponsible for this term. This peer can, by means of in-expensive bit-wise operations, calculate an estimate for theglobal df, for the terms it is responsible for, from these syn-opses. Note the importance of utilizing compact synopsesfor this goal, such as hash sketches, which introduce smallbandwidth and storage requirements.

2http://www.minerva-project.org

The query initiator collects the df estimates at query timeas piggybacked information when retrieving the PeerListsfrom the directory peers during the query routing phase.Remember that the df estimate for a particular term ismaintained at the same peer that maintains the respectivePeerList, so that the peers that hold the gdf estimated forthe query terms are the very same peers that are contactedanyway in order to retrieve the respective PeerLists. Thequery initiator can then attach the current gdf estimates tothe query message when sending the query to the selectedpeers. These remote peers can use the estimates on-the-flyas weights during their index scans to compute their localquery results.

Note that it is not a design choice to let the remote peerssimply return unnormalized (“objective”) scores (e.g., basedon tf values only) and then let the query initiator do there-calibration using gdf estimates. In that case, the localquery execution at the remote peers may already miss someof the desired (i.e., globally best) results. For example, high-scoring documents for the terms with low gdf (i.e., high idf)may not be returned at all in that case.

Note that the performance of the local query execution it-self is not affected at all by the necessary online score recal-ibration: if index lists are created on (doc id, score)-tuples(where now the score item does no longer include a locally bi-ased df component, but some possibly normalized derivateof the tf value), index lists can easily be sorted by thesescores and index list scanning can be performed as usual.One extra computational operation is required for each listitem to compute the final (term-) score for this item (nor-malization using the global df estimate). In this case, theorder of items in an index list does not change, as all scoresin a list are re-weighted by the same df value (monotonicityapplies). Thus, all index structures and performance accel-eration techniques work without special adaptation.

4.3 Cost AnalysisMost of the network cost is caused during the posting

process, i.e., when a peer publishes its per-term metadata.Conceptually, each Post consists of the term it represents,an IP address and port number, plus collection-specific sta-tistical information (e.g., collection size) and term-specificstatistical information (e.g., document frequency and max-imum term frequency). In our prototype, such a Post onaverage accounts for approximately 50 bytes. Our exper-iments have shown that a hash sketch with a reasonablysmall number of 8-byte bitmaps, e.g., 64 bitmaps, allows agood estimation for our purposes. Such a hash sketch re-quires 64 ∗ 8 = 512 bytes, i.e., it fits easily in the same TCPpacket that is needed anyway to send the metadata itself tothe responsible directory peer. Thus, the number of mes-sages necessary to disseminate the Posts does not increase.

Where applicable, we use batching of Posts (for termsthat have the same directory peer) to further decrease thenumber of messages. For all messages, we can additionallyapply gzip compression to additionally decrease the messagepayload size. Obviously, the network cost caused by themetadata publication additionally depends on the Time-to-Live interval of the metadata, i.e., the time span after whichthe metadata has to be refreshed. We report on actual trafficmeasured in the course of our experimental evaluation inSubsection 5.2.

After the dissemination of the Posts, peers executing aquery perform PeerList requests to retrieve a list of peersthat have published statistics about the specific query terms.Note that the cost of this PeerLists retrieval does not changesignificantly, as the hash sketches themselves are not trans-ferred back to the PeerList requester. Instead, as the dfestimation is conducted at the directory peer, only one ad-ditional value representing the current df estimate has to

be included in the answer to a PeerList request. The sameholds true for the actual query execution; when sending thequery to the selected peers, just one additional df value perquery term has to be transferred.

The storage cost at the directory peers storing the Postsis also directly dependent on the number of Posts, the sizeof a Post, and the size of a hash sketch. In a network withn peers storing Posts of m distinct terms, each peer is re-sponsible for an expected number of m/n PeerLists. Forexample, in a system with 50, 000 terms and 10, 000 peers,each peer is responsible for the maintenance of an averageof 5 PeerLists. This number decreases even further as moreand more peers join the system, because they typically donot add a significant number of previously unseen terms. Ina worst case scenario (every peer has posted information forall terms), a directory peer would thus be responsible for50.000 Posts or 28.1 MB (including all hash sketches) foreach peer list, which we consider a reasonably small stor-age effort. Remember from the previous subsection that,alternatively, the directory peer does not have to store allhash sketches sent together with the Posts, but can aggre-gate them immediately using our sliding-window approach- at the cost of increased network traffic.

The additional computational cost incurred by addinghash sketches to the posting process is also negligible. Fornearly no additional cost, the peer that receives the hashsketches for a particular term can combine these in an itera-tive manner by simple bit-wise OR operations of bit vectors.

5. EXPERIMENTS5.1 Cardinality Estimation using Hash Sketches

To study the accuracy and the robustness of hash sketchesas set cardinality estimators, we have performed a series of100 runs, each for 256 8-byte bitvectors per sketch and dif-ferent set sizes. The document sets are randomly createdfor each run from a sufficiently large domain. We reporton the accuracy of the cardinality estimation using accu-racy, i.e., the ratio estimated cardinality

true cardinality. As shown in Figure

1, the estimation works very well. On average (over 100runs each), the accuracy is close to 100%, the standard de-viation is sufficiently small with quartile errors around 5%.The plotted quartiles show the robustness of the approachwhich, as expected [16, 15], becomes better as more bitmapsare used per hash sketch. Notice also that the errors in bothfigures tend to become smaller for larger sets; this indicatesthat hash sketches will work very well in our intended en-vironment (i.e., a large-scale system with a high number ofdocuments).

5.2 DF Estimation in the Presence of ChurnWhile the previous experiments assumed a static setup

of peers and their hash sketches, we now want to evaluatethe accuracy of our approach in the presence of network dy-namics. For this purpose, we consider a model of arrivalsand departures as outlined in [25], where nodes arrive ac-cording to a Poisson process with rate λ, while a node inthe system departs according to an exponential distributionwith rate parameter µ. Resulting in a system of about 1, 000peers at a time, we assume time units of 10 min and chooseλ = 3 and µ = 0.002 and fix the interval at which peersrefresh their statistics at 6 time units (60 min). For simplic-ity, we further set the Time-to-live for all statistics also to 6time units. In a real world scenario, one could argue to in-crease the TTL to cope with network latencies and networkfailures, such that Posts in the directory survive one failedrefresh attempt. Each peer randomly picks 1,000 documentsfrom a domain of 2,000,000 documents. We use 256-bitmaphash sketches for our evaluation.

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

0 20000 40000 60000 80000 100000

Accura

cy

Number of Documents

256 Bitmaps

1. Quartile

Median3. Quartile

Figure 1: Cardinality Estimation Accuracy (256-bitmap Hash Sketches)

Figure 2 plots the document frequency estimates obtainedby our approach together with the true document frequency.While intuitively, the approach should tend to overestimatethe number of documents in the system, because metadataof peers that have recently left the system hang around forsome time before they time out, in practice our experimentsdon’t clearly show this behavior. This is due to the factthat the hash sketches themselves show a certain degree ofvariance that overrules the (usually small) conceptual errorsof the approach. Nevertheless, the approach has been shownto be robust against churn.

Regarding the network traffic caused by the experimentunder the above assumptions with only one term per peer,we can report an average bandwidth consumption of onlyless than 11 kilobytes per peer per hour (no gzip compres-sion applied). Even for typical numbers of terms per peer(50,000-100,000), this is within todays bandwidth capabili-ties.

0

200000

400000

600000

800000

1e+006

1.2e+006

1.4e+006

0 1000 2000 3000 4000 5000

Nu

mb

er

of

Do

cu

me

nts

Time

Quality of df Estimation in the Presence of Churn

realestimated

Figure 2: df Estimation Accuracy (256-bitmap HashSketches)

5.3 Improving Result MergingFor this experiment we use real-world Web data from

10 topically focused collections harvested by a focused webcrawler. In order to create a benchmark, we have spliteach topical collection into 4 fragments. We create 40 peerssuch that each peer hosts 3 out of 4 fragments from thesame topic, thus creating high overlap among same-topicpeers. As query load, we 30 popular Google queries taken

from Zeitgeist3. We use CORI [7] as a common query rout-ing strategy and compare 4 different result merging strate-gies.The document scores are based on collection-specific(i.e., “local”) df values or our global df values and are nor-malized using their respective weighted CORI peer score(from the query routing phase) or not. For this normaliza-tion, more specifically, we use the norm-dbs method used bythe INQUERY framework [8], that re-computes documentscores as score = (D + 0.4 × Cnorm × D)/1.4. As a rankdistance, we use Spearman’s footrule distance [14]., definedas F (σ1, σ2) =

Pi |σref (i) − σpeers(i)| where σref (i) is the

rank of document i in the reference ranking and σpeers(i) isthe position of document i in the peers’ document ranking.If a document from σpeers is not in σref we assign a fictitiousrank (k + 1). We normalize all distances by 1− distance

maxdistanceto obtain a normalized quality measure.

Figure 3 shows the results for the 40-peers benchmark.With local query execution based on global df values, theranking quality is remarkably above the quality obtained bythe CORI-based merging methods. In particular, three outof the four methods do not even come close to the optimaldocument ranking at all, even if all 40 peers are involved inthe query. This is due to the fact that the document scoresbased on local df values are incomparable across the peersand, thus, documents that are not in the reference top-20document ranking are pushed in by skewed local df scoresat the peers.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35

Qualty o

f R

ankin

g

Number of Queried Peers

40 Thematically Focused Peers + Google Zeitgeist Queries

Document score merger (global)Document score merger (local)

CORI merger (global)CORI merger (local)

Figure 3: Quality of Query Results (40 Peers)

6. CONCLUSION AND OUTLOOKThis paper has developed and evaluated a novel and ef-

ficient algorithm to estimate global document frequenciesin large-scale dynamic P2P networks. Our algorithm uti-lizes compact synopses based on hash sketches, which canbe combined from an arbitrary number of autonomous dis-tributed sources without incurring additional error. To ourknowledge, this is the first approach to this problem that cancope with arbitrarily overlapping collections without addi-tional effort. We study the network and storage require-ments and present a detailed study of the accuracy of hashsketches for our specific purpose for static and dynamic net-works.

We point out that the main focus of this paper is not toquantify the effect that the knowledge of global df values canhave on result merging. The corresponding experiment isonly preliminary, but nevertheless indicates the potential forimprovements. While this effect has already been observed

3www.google.com/press/zeitgeist.html

in the literature [24], more comprehensive experiments onresult merging in P2P networks are subject of future work.

Our approach can be generalized to all forms of distributedsystems that can benefit from global counting with dupli-cate elimination, e.g., cardinality estimations in distributeddatabase systems.

7. REFERENCES[1] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and

T. V. Pelt. Gridvine: Building internet-scale semanticoverlay networks. In ISWC 2004.

[2] K. Aberer, M. Punceva, M. Hauswirth, andR. Schmidt. Improving data access in p2p systems.IEEE Internet Computing, 6(1):58–67, 2002.

[3] M. Bender, S. Michel, P. Triantafillou, G. Weikum,and C. Zimmer. Improving collection selection withoverlap awareness in p2p search engines. In SIGIR05,Salvador, Brasil, 2005. ACM.

[4] M. Bender, S. Michel, P. Triantafillou, G. Weikum,and C. Zimmer. Minerva: Collaborative p2p search. InVLDB, 2005.

[5] B. H. Bloom. Space/time trade-offs in hash codingwith allowable errors. Commun. ACM, 13(7), 1970.

[6] J. Byers, J. Considine, M. Mitzenmacher, and S. Rost.Informed content delivery across adaptive overlaynetworks. SIGCOMM, 2002.

[7] J. Callan. Distributed information retrieval. Advancesin information retrieval, Kluwer Academic Publishers.,2000.

[8] J. P. Callan, W. B. Croft, and J. Broglio. Trec andtipster experiments with inquery. Inf. Process.Manage., 31(3), 1995.

[9] J. P. Callan, Z. Lu, and W. B. Croft. Searchingdistributed collections with inference networks. InSIGIR, 1995.

[10] S. Cohen and Y. Matias. Spectral bloom filters. InSIGMOD 2003.

[11] G. Cormode and M. N. Garofalakis. Sketching streamsthrough the net: Distributed approximate querytracking. In VLDB, 2005.

[12] G. Cormode, S. Muthukrishnan, and W. Zhuang.What’s different: Distributed, continuous monitoringof duplicate-resilient aggregates on data streams. InICDE, 2006.

[13] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, andT. D. Nguyen. Planetp: Using gossiping to buildcontent addressable peer-to-peer information sharingcommunities. In HPDC, 2003.

[14] P. Diaconis and R. Graham. Spearman’s footrule as ameasure of disarray. Journal of the Royal StatisticalSociety, 1977.

[15] M. Durand and P. Flajolet. Loglog counting of largecardinalities. In G. Di Battista and U. Zwick, editors,ESA03, volume 2832 of LNCS, Sept. 2003.

[16] P. Flajolet and G. N. Martin. Probabilistic countingalgorithms for data base applications. Journal ofComputer and System Sciences, 31(2), 1985.

[17] N. Fuhr. A decision-theoretic approach to databaseselection in networked IR. ACM Transactions onInformation Systems, 17(3), 1999.

[18] S. Ganguly, M. Garofalakis, and R. Rastogi.Processing set expressions over continuous updatestreams. In SIGMOD, 2003.

[19] L. Gravano, H. Garcia-Molina, and A. Tomasic. Gloss:text-source discovery over the internet. ACM Trans.Database Syst., 24(2), 1999.

[20] T. Hernandez and S. Kambhampati. Improving textcollection selection with coverage and overlap

statistics. pc-recommended poster. WWW 2005. Fullversion available athttp://rakaposhi.eas.asu.edu/thomas-www05-long.pdf.

[21] S. Idreos, M. Koubarakis, and C. Tryfonopoulos.P2p-diet: An extensible p2p service that unifiesad-hoc and continuous querying in super-peernetworks. In SIGMOD, 2004.

[22] M. Jelasity, A. Montresor, and O. Babaoglu.Gossip-based aggregation in large dynamic networks.ACM Trans. Comput. Syst., 23(1):219–252, 2005.

[23] S. Kirsch. Document retrieval over networks whereinranking and relevance scores are computed at theclient for multiple database documents. US patent5,659,732, 1997.

[24] L. S. Larkey, M. E. Connell, and J. P. Callan.Collection selection and results merging with topicallyorganized u.s. patents and TREC data. In CIKM2000.

[25] D. Liben-Nowell, H. Balakrishnan, and D. Karger.Analysis of the evolution of peer-topeer systems, 2002.

[26] J. Lu and J. Callan. Merging retrieval results inhierarchical peer-to-peer networks. In SIGIR, 2004.

[27] J. Lu and J. P. Callan. Content-based retrieval inhybrid peer-to-peer networks. In CIKM, pages199–206, 2003.

[28] A. Manjhi, S. Nath, and P. B. Gibbons. Tributariesand deltas: Efficient and robust aggregation in sensornetwork streams. In SIGMOD 2005.

[29] W. Meng, C. Yu, and K.-L. Liu. Building efficient andeffective metasearch engines. ACM Comput. Surv.,34(1):48–89, 2002.

[30] S. Michel, M. Bender, P. Triantafillou, andG. Weikum. Iqn routing: Integrating quality andnovelty in p2p querying and ranking. In EDBT, pages149–166, 2006.

[31] H. Nottelmann, G. Fischer, A. Titarenko, andA. Nurzenski. An integrated approach for searchingand browsing in heterogeneous peer-to-peer networks.In HDIR 2005.

[32] H. Nottelmann and N. Fuhr. Evaluating differentmethods of estimating retrieval quality for resourceselection. In SIGIR, 2003.

[33] O. Papapetrou, S. Michel, M. Bender, and G. Weikum.On the usage of global document occurrences inpeer-to-peer information systems. In COOPIS 2005.

[34] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Schenker. A scalable content-addressable network.In SIGCOMM 2001.

[35] S. E. Robertson and S. Walker. Some simple effectiveapproximations to the 2-poisson model forprobabilistic weighted retrieval. In SIGIR, 1994.

[36] A. Rowstron and P. Druschel. Pastry: Scalable,decentralized object location, and routing forlarge-scale peer-to-peer systems. In Middleware 2001.

[37] L. Si, R. Jin, J. Callan, and P. Ogilvie. A languagemodeling framework for resource selection and resultsmerging. In CIKM 2002.

[38] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, andH. Balakrishnan. Chord: A scalable peer-to-peerlookup service for internet applications. In SIGCOMM2001.

[39] T. Suel, C. Mathur, J. wen Wu, J. Zhang, A. Delis,M. Kharrazi, X. Long, and K. Shanmugasundaram.Odissea: A peer-to-peer architecture for scalable websearch and information retrieval. In WebDB, 2003.

[40] Y. Wang, L. Galanis, and D. J. de Witt. Galanx: Anefficient peer-to-peer search engine system. Availableat http://www.cs.wisc.edu/ yuanwang.