scalable content-based ranking in p2p information retrieval

8
I. Lovrek, R.J. Howlett, and L.C. Jain (Eds.): KES 2008, Part I, LNAI 5177, pp. 633–640, 2008. © Springer-Verlag Berlin Heidelberg 2008 Scalable Content-Based Ranking in P2P Information Retrieval Maroje Puh 1 , Toan Luu 2 , Ivana Podnar Zarko 1 , and Martin Rajman 2 1 University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia 2 Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland {Maroje.Puh,Ivana.Podnar}@fer.hr, {VinhToan.Luu,Martin.Rajman}@epfl.ch Abstract. Numerous retrieval models have been defined within the field of information retrieval (IR) to produce a ranked and ordered list of documents relevant to a given query. Existing models are in general well-explored and thoroughly evaluated using traditionally centralized IR engines. However, the problem of producing global relevance scores to enable document ranking in peer- to-peer (P2P) IR systems has largely been neglected. Traditional ranking models in general require global document collection metrics such as document fre- quency, average document length, or the number of collection documents, which are not readily available in P2P IR systems. In this paper, we present a scalable solution for content-based ranking using global relevance scores in P2P IR sys- tems that has been implemented as a part of ALVIS PEERS, a full-text IR engine developed for structured P2P networks. The provided experimental results show efficient and scalable performance of here proposed ranking implementation. Keywords: P2P, Information retrieval, Content-based ranking. 1 Introduction As the amount of web content is continuously growing and changing, it becomes more important to design and deploy widely-distributed and decentralized search engines that can efficiently operate in such dynamic environments. State-of-the art search engines are currently centralized and optimized for highly-responsive query answering using huge document indexes distributed over large proprietary clusters. Although such systems enable highly-efficient information access to millions of users, the amount of indexed documents currently represents a small fraction of the constantly growing web data. Centralized engines have difficulties to scale with the growing web size and constantly changing content [1]. Therefore, research efforts are currently directed to designing distributed and decentralized open-source retrieval systems [2]. Peer-to-peer (P2P) technology has become an appealing architecture for widely-distributed IR systems due to its properties, such as decentralization, self organization, and resource sharing [3]. Search engines are designed to efficiently find documents relevant to a user query, while the quality of retrieved documents depends on the ability of the information provided by retrieved documents to satisfy user information needs. Various IR models and ranking techniques have been developed over the years that aim at achieving a better quality of

Upload: epfl

Post on 06-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

I. Lovrek, R.J. Howlett, and L.C. Jain (Eds.): KES 2008, Part I, LNAI 5177, pp. 633–640, 2008. © Springer-Verlag Berlin Heidelberg 2008

Scalable Content-Based Ranking in P2P Information Retrieval

Maroje Puh1, Toan Luu2, Ivana Podnar Zarko1, and Martin Rajman2

1 University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia 2 Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

{Maroje.Puh,Ivana.Podnar}@fer.hr, {VinhToan.Luu,Martin.Rajman}@epfl.ch

Abstract. Numerous retrieval models have been defined within the field of information retrieval (IR) to produce a ranked and ordered list of documents relevant to a given query. Existing models are in general well-explored and thoroughly evaluated using traditionally centralized IR engines. However, the problem of producing global relevance scores to enable document ranking in peer-to-peer (P2P) IR systems has largely been neglected. Traditional ranking models in general require global document collection metrics such as document fre-quency, average document length, or the number of collection documents, which are not readily available in P2P IR systems. In this paper, we present a scalable solution for content-based ranking using global relevance scores in P2P IR sys-tems that has been implemented as a part of ALVIS PEERS, a full-text IR engine developed for structured P2P networks. The provided experimental results show efficient and scalable performance of here proposed ranking implementation.

Keywords: P2P, Information retrieval, Content-based ranking.

1 Introduction

As the amount of web content is continuously growing and changing, it becomes more important to design and deploy widely-distributed and decentralized search engines that can efficiently operate in such dynamic environments. State-of-the art search engines are currently centralized and optimized for highly-responsive query answering using huge document indexes distributed over large proprietary clusters. Although such systems enable highly-efficient information access to millions of users, the amount of indexed documents currently represents a small fraction of the constantly growing web data. Centralized engines have difficulties to scale with the growing web size and constantly changing content [1]. Therefore, research efforts are currently directed to designing distributed and decentralized open-source retrieval systems [2]. Peer-to-peer (P2P) technology has become an appealing architecture for widely-distributed IR systems due to its properties, such as decentralization, self organization, and resource sharing [3]. Search engines are designed to efficiently find documents relevant to a user query, while the quality of retrieved documents depends on the ability of the information provided by retrieved documents to satisfy user information needs. Various IR models and ranking techniques have been developed over the years that aim at achieving a better quality of

634 M. Puh et al.

retrieved ordered list of documents. The state-of-the art centralized search engines have successfully implemented existing models, however the process of adopting existing ranking techniques in distributed environments is not straightforward because of the unavailability of global collection statistics that are needed for ranking computation. Furthermore, P2P solutions can potentially induce high and unscalable traffic during the ranking process.

The paper presents the ranking technique implemented as part of ALVIS PEERS, a fully-functional IR search engine which uses a structured P2P overlay for building a distributed inverted index for large document collections [4]. Alvis uses a novel retrieval model based on indexing with Highly Discriminative Keys (HDKs)— terms and sets of terms occurring in a limited number of documents. HDKs may be seen as highly-selective multiterm queries associated with precomputed answer sets which enable efficient retrieval because of the short size of the associated posting.

Section 2 of this paper presents the related work. Section 3 describes the architecture of ALVIS search engine and the HDK indexing approach. The design and implementation of the contend-based ranking component integrated in ALVIS is presented in section 4, with its performance analysis given in section 5. Section 6 concludes the paper and presents future work.

2 Related Work

Distributed content-based ranking techniques depend on the indexing strategy used in the system. Two basic indexing strategies in P2P IR networks are federated local indexes used in unstructured P2P networks and the global index used in structured P2P networks [2].

In federated local indexes approach, disjunctive subsets of a global document col-lection are hosted on the peers and each peer is an independent search engine with its own local index. In such networks flooding is used to locate the data, resulting in high bandwidth costs and no guarantee that all relevant nodes will eventually be reached. To decrease the bandwidth consumption during the query phase, advanced approaches use two level querying: peers are independent search engines with local indexes, while the network or special nodes maintain the global peer index that is smaller and easier to maintain than global single-term index. The example search engine project using this combination indexing strategy is the Minerva project [5].

In global index approach, the overlay structured P2P network maintains a global index and each peer in the network is responsible for a maintaining a disjunctive part of the global index. Structured P2P networks enable efficient resource lookup process by employing different strategies, one of which is distributed hash table (DHT). Odis-sea, a P2P architecture for Web search [6] is an example of this indexing strategy.

3 The Alvis P2P Search Engine

This section explains the HDK-based indexing approach and presents the architecture of the P2P retrieval engine ALVIS PEERS [4]. The major obstacle for implementing

Scalable Content-Based Ranking in P2P Information Retrieval 635

P2P full-text retrieval is unscalable network bandwidth consumption, caused by transmissions of long posting lists among peers when processing queries in a P2P system. To overcome this obstacle, the HDK-based indexing approach [7] has been introduced that results in shorter posting lists, but still achieves the retrieval quality comparable to the one in a centralized environment. Instead of indexing with single terms, this approach truncates large posting lists to a constant size, while compensating the resulting loss of information by indexing, in addition, carefully selected combinations of terms. Consequently, the index contains a larger number of index entries while all are associated with short posting lists. Sets of terms (we call them keys) forming an index entry have to occur simultaneously in a single document within a window of predefined size. If such keys occur in less than DFmax documents (DFmax is the parameter of our model), they are considered discriminative w.r.t document collection, i.e. such keys are HDKs. In case keys occur in more than DFmax documents, the index stores only top-DFmax ranked documents, while such key is a candidate to be extended with another term to create a new HDK.

We assume that each peer participating in the P2P IR engine contributes a set of local documents which constitute a part of the global document collection. From the local point of view, each peer indexes its local documents, i.e., peers compute keys and associated posting lists for its local collection and insert them into the global index. From the global point of view, peers build a DHT which is used to maintain a global inverted index. Each peer maintains a part of the global index assigned to it by the DHT, i.e. it stores a number of keys and associated posting lists. The peer also enables document retrieval by interacting with the DHT to retrieve the list of documents relevant to the submitted query and participates in the ranking procedures which are described in details in Section 4. The architecture of the Alvis P2P search engine is decomposed into layers presented in Figure 1.

The P2P layer builds a DHT for storing the global HDK index associating keys to document frequencies and posting lists. Each posting also includes statistics relevant to that document: term frequency of each key term in the given document, and document length, which are used for ranking.

The HDK layer is responsible for two tasks: during the key-based indexing task peers build the set of keys and associated posting lists from their local document collection, and during the querying task, a peer has to find relevant keys in the global index, retrieve the posting lists that are associated with such keys and merge them. The indexing task is triggered when the peer joins the network: first, a peer builds a standard single term index from its local collection and inserts it into the DHT. Next, it waits for messages from the DHT notifying them to expand certain single-term keys that appear in more than Dfmax global documents. Upon receiving such a request, the peer expands the key, and inserts it together with its posting list into the DHT. Note, however, that during the insertion of a key-posting pair into the DHT, a maximum of Dfmax postings will be inserted into the network. During the querying phase, a peer that received the query maps the query terms into keys that are stored in global index. The peer explores the lattice of query term combinations starting with the largest possible term set, which is limited either by the query size or the maximal key size. If this term set doesn’t exist in the global index, term combinations of decreasing sizes are explored, and this process continues until all terms forming a query are covered

636 M. Puh et al.

RankingHDK Index Query

P2P

Web service IFIR PEER

GKI Global Key Index

RankingHDK Index Query

P2P

Web service IFIR PEER

GKI

RankingHDK Index Query

P2P

Web service IFIR PEER

GKI

Fig. 1. Overview of the P2P search engine architecture

with retrieved keys. The resulting posting list is the union of postings associated with retrieved keys, and it is used as the input data to the Ranking layer.

The Ranking layer is responsible for producing a ranked and ordered set of documents during both the indexing and querying phases, and is described in detail in the following section.

4 Implementation of the Content Based Ranking Component

As mentioned, the Ranking layer implemented in Alvis prototype is in charge of computing document rankings during both the indexing and query processes. The score of a document w.r.t. a term is done using the well-known BM25 relevance function, which utilizes the following statistics:

Global values • Term dependant: document frequency of the term in the global collection; • Term independent: average document length, number of documents in the

collection. Local values

• Term dependant: term frequency in a document; • Term independent: document length.

The score of a document w.r.t. a key (which is a set of terms) is calculated as the sum of individual scores of the document w.r.t. each term in the key. As mentioned before, local values are stored in the global index with each stored posting. Term independent global values are retrieved periodically by the ranking layer of each peer, and stored locally. Global document frequency of a key in the global collection is maintained by a peer responsible for that key, as determined by the DHT.

4.1 Ranking Layer Role in Indexing Process

When a peer joins the network, it starts to index its local document collection, and inserts its keys and associated postings containing local values into the P2P overlay. If

Scalable Content-Based Ranking in P2P Information Retrieval 637

the size of the posting list exceeds the Dfmax parameter, only Dfmax most relevant postings for the given key will be inserted into the network. Therefore, the ranking layer needs to rank all postings w.r.t. the key, produce an ordered set of postings, and insert top-Dfmax postings into the network. Local values needed for ranking are available as they are obtained during the indexing of the local collection. However, instead of using global document collection statistics which need to be retrieved from the network, the ranking layer uses local collection values. Ranking with local collection statistics (in particular document frequencies of key terms) produces the same postings order like when ranking with global statistics, but in addition saves bandwidth and improves performance. After producing the ordered set of postings for a key, top-Dfmax postings and inserted into the network.

As mentioned before, from the global view of the network, each peer is responsible for storing and maintaining a part of the global index. A peer stores a maximum of Dfmax postings associated with each key. As a peer continuously receives posting lists for a key, the cumulative number of postings may become greater than Dfmax. In this case, the ranking layer will need to rank both stored postings and newly received postings w.r.t. the key, and store only best ranked Dfmax postings. For ranking scores calculation, the ranking layer uses document-related statistics available in the global index and global statistics which are available at each peer (number of documents, average document length) as they are periodically requested from the network. Note that the global document frequency of a key is available locally at this peer only if the key is single-term. However, if the key consists of multiple terms, global document frequency of each key term needs to be retrieved from the peers which are responsible for maintaining them in the global index. After retrieving these document frequencies, the ranking layer ranks all documents, stores the top-Dfmax ranked documents, and discards the rest.

4.2 Ranking Layer Role in Querying Processing

The ranking layer produces the final ranked and ordered result set according to the relevance of a document w.r.t. a query Q. The following example illustrates the rank-ing function implemented in our prototype. First, it relies on a retrieval procedure to locate relevant documents. Second, the ranking of the retrieved documents is per-formed. Figure 2 illustrates the process of ranking a retrieved document set w.r.t. the query. Assume the query originator, Peerq, produced a query Q consisting of terms t1, t2 and t3. The HDK layer retrieved posting lists for keys k1 = {t1, t2} and k2 = t3 from the global index. Retrieved global index entries contain document-related statistics (term frequencies and document lengths), which are required for ranking. The ranking function also requires the global document frequency of each query term. For single term keys, the document frequencies are retrieved during the HDK search, while for multiple term keys they need to be separately retrieved. In our example, the global document frequency of key k2 is retrieved during HDK retrieval, while global docu-ment frequency for terms t1 and t2 of key k1 need to be additionally retrieved from peers responsible for storing these terms in the global index. Other global collection statistics such as the number of documents and average document length that are required for ranking are periodically retrieved from the network, so they are readily available at the querying peer.

638 M. Puh et al.

Fig. 2. Ranking a document set when answering a query

Having retrieved the required document frequencies from the global index, the querying peer is able to calculate the score of each retrieved document w.r.t. query, and produce a ranked and ordered set of retrieved documents. In order to present the retrieved and ranked documents to the user, document digests comprising a document title, snippet and URL, need to be retrieved from peers storing these documents. However, digests for only ten top-ranked documents will be retrieved, as the user is rarely interested in more than ten best-ranked documents. Besides, retrieving digests for all retrieved documents might result in contacting a very large number of peers, and would prove unscalable in terms of bandwidth consumption. If a user is interested in other documents besides the ten best, digests for these documents will be retrieved on demand, in steps of ten digests for ten documents currently viewed by the user.

5 Performance Analysis

Performance analysis of the ranking layer implemented in our prototype was per-formed in our lab environment. Each peer was running on a separate machine with the following characteristics: Intel Celeron, 2.66 GHz, 1024 MB of RAM. The document collection used in the setup was from the Reuters corpus, and each test included sub-mitting 1000 queries from the Wikipedia query log were to the engine.

Figure 3 depicts the average distributed ranking time per query in a network of 10 peers and a growing document collection where each peer stored from 1000, up to 12000 documents. At first, the ranking time slightly increases since peers become loaded with the size of the index, but afterwards becomes constant with global collec-tion growth.

Scalable Content-Based Ranking in P2P Information Retrieval 639

Ranking time per query [ms]

0

20

4060

80

100

120

10000 20000 40000 60000 80000 100000 120000

# documents in the global collection

mill

isec

on

ds

Fig. 3. Ranking time per query with a growing document collection

We experimentally measured the number of peers that need to be contacted by the querying peer in order to rank the resulting documents. Our network consisted of 4, 6, 8, and 10 peers, where each peer stored 5000 documents. The ranking layer needs to retrieve global document frequency for each term in the query (had they not been retrieved during the HDK search), and retrieve document digests for best ten docu-ments.

Number of contacted peers per query

0

1

2

3

4

5

6

7

8

4 6 8 10# peers in P2P network

# co

nta

cted

pee

rs

document frequencies digests

Fig. 4. Number of peers contacted by the ranking peer during a query

The results shown in figure 4 indicate that during a query the querying peer needs to contact in average 2,5 peers to retrieve global document frequencies, and that this num-ber remains constant with network growth. This is reasonable, as this value depends mostly on the number of terms in a query, which is also an upper bound for this value. The results also indicate that the number of requests for document digests sent by the querying peer grows linearly when increasing the number of peers in the network. How-ever, we only retrieve 10 document digests for best-ranked documents, so in the worst case scenario the number of contacted peers for digest retrieval is 10, no matter the network size. Experimental evaluation has shown that our ranking process

640 M. Puh et al.

is scalable and independent of the network size, as both ranking time and number of exchanged ranking messages between the peers when processing a query are bounded.

6 Conclusion

To satisfy user information needs in a P2P IR system, efficient ranking functionality is a necessity. Ranking implementation depends on the indexing strategy supported by the P2P search engine and the available collection statistic data. The Alvis P2P search engine maintains global collection statistics, and local document statistics that are needed for the computation of document ranking scores. With the available global statistics it is possible to use state-of-the-art ranking models that use both term de-pendant, and term independent statistic values statistic values for rank computation. The ranking mechanism implemented in Alvis search engine scales well in a growing P2P network in terms of ranking time per query and number of messages exchanged between the peers during the ranking process.

Future work will include the design of a link-based ranking module to additionally refine the content-based ranking scores. We also consider designing a community based ranking model to identify peers with similar interests and rank documents de-pending on the preferences of the community in which they are included.

References

[1] Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., Silvestri, F.: Challenges in dis-tributed information retrieval (invited paper). In: ICDE (2007)

[2] Yee, W.G., Beigbeder, M., Buntine, W.: SIGIR06 workshop report: Open Source Informa-tion Retrieval systems (OSIR06). SIGIR. Forum. 40(2), 61–65 (2006)

[3] Aberer, K., Alima, L.O., Ghodsi, A., Girdzijauskas, S., Haridi, S., Hauswirth, M.: The Es-sence of P2P: A Reference Architecture for Overlay Networks. In: Fifth IEEE Interna-tional Conference on Peer-to-Peer Computing, pp. 11–20 (2005)

[4] Luu, T., Klemm, F., Podnar, I., Rajman, M., Aberer, K.: ALVIS Peers: A Scalable Full-text Peer-to-Peer Retrieval Engine. In: Workshop on Peer-to-Peer Information Retrieval (P2PIR 2006), ACM 15th Conference on Information and Knowledge Management Work-shops, November 2006, pp. 41–48 (2006)

[5] Bender, M., Michel, S., Weikum, G., Zimmer, C.: The MINERVA Project: Database Se-lection in the Context of P2P Search. In: BTW 2005, Karlsruhe, Germany (2005)

[6] Suel, T., Mathur, C., Wu, J.-W., Zhang, J., Delis, A., Kharrazi, M.I., Long, X., Shanmuga-sundaram, K.: ODISSEA: A Peer-to-Peer Architecture for scalable Web Search and In-formation Retrieval. In: International Workshop on the Web and Databases (WebDB 2003), San Diego, California, USA (2003)

[7] Podnar, I., Rajman, M., Luu, T., Klemm, F., Aberer, K.: Beyond term indexing: A P2P framework for web information retrieval. Informatica 2(30), 153–161 (2006)