p2p information retrieval: a self-organizing …oak.cs.ucla.edu/~sia/tp3.pdfp2p information...

10
P2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR [email protected] November 28, 2002 Abstract In the light of image retrieval evolving from text anno- tation to content-based and from standalone applications to web-based search engines to increase popularity, we foresee the need for deploying content-based image re- trieval (CBIR) into Peer-to-Peer (P2P) architecture. By doing so, we not only distribute the tasks of feature ex- traction, indexing and storage of image data into peers, we also introduce another aspect of searching in addition to filename-based method in prevalent P2P applications. Through the deployment of DISCOVIR, we introduce a model to improve query efficiency targeting on content- based search. As clustering technique is widely used in database and information retrieval system for organizing data and im- proving retrieval efficiency. We surmise such functionality is valuable to a P2P distributed environment. In this pa- per, we introduce the concept of peer clustering at the level of overlaying network topology, and content-based query routing strategy to improve existing retrieval meth- ods. We design and implement a DIStributed COntent- based Visual Information Retrieval (DISCOVIR) sys- tem with content-based query functionality and improved query efficiency. We demonstrate its scalability and effi- ciency through simulation. 1 Introduction The appearance of Peer-to-Peer (P2P) applications in recent years have demonstrated the significance of dis- tributed information sharing systems by offering the ad- vantages of scalable storage space. However, since data storage is decentralized, data location process need to be reformulated in order to achieve efficient lookup. Cur- rent researches focus on improving data access under the filename-based lookup, i. e. filename is the only term to index a file. We argue that such lookup process is in- sufficient when the data collection is huge or distributed, especially in the P2P environment. Users might annotate the same file with different filenames, making the data location process error-prone and user dependent. We be- lieve content-based information retrieval should be the ultimate goal of a P2P system. Regarding to current content-based image retrieval (CBIR) systems, we envisage the potential use of P2P network in both scattering data storage and distributing workload of feature extraction and indexing. Through the realization of CBIR in P2P network, we can support enor- mous image collection without installing high-end equip- ment for the web server and incorporate individual user’s contribution compared to prevalent centralized web-based search engine. Also, we make use of the computation power of peers for image preprocessing and indexing in addition to data storage. By combing CBIR into a P2P environment, we 1) revo- lutionize the way of searching in existing P2P architecture and 2) overcome the bottleneck problem of running CBIR in a centralized fashion. In this paper, we introduce the design and implementation of DISCOVIR (DIStributed COntent-based Visual Information Retrieval) [4, 26], a CBIR software running on Gnutella [7]. In the notion of CBIR in P2P, current searching methodologies in P2P system might not be applicable or efficient as they fo- cus on filename-based search. Therefore, we propose the Peer Clustering at the level of overlaying network topol- ogy, and Firework Query Model (FQM) [20] to facilitate content-based searching in P2P. In the process of integrating CBIR into P2P and de- signing of algorithm for search efficiency improvement, we discover some resemblance between P2P and Neural Net- works. We envision to model the P2P as Neural Network might be a potential to improve P2P system into a more intelligent environment rather than a data sharing net- work. We try to give some preliminary ideas at the end of this paper and set this as the future direction. In the coming section, we will detail the related works in CBIR and P2P, and justify the motivation and ne- cessity of combining them. In Section 3.1, we introduce the architecture of DISCOVIR and the functionality of its components, while in Section 3.2, we detail the Peer Clustering and FQM for improving content-based search in P2P. In Section 4, we present experiment result of sim- ulation of FQM and give some analysis. We address on 1

Upload: lethu

Post on 02-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

P2P Information Retrieval: A Self-Organizing Paradigm

Ka Cheung SiaDepartment of Computer Science and Engineering

The Chinese University of Hong KongShatin, N.T., Hong Kong SAR

[email protected]

November 28, 2002

Abstract

In the light of image retrieval evolving from text anno-tation to content-based and from standalone applicationsto web-based search engines to increase popularity, weforesee the need for deploying content-based image re-trieval (CBIR) into Peer-to-Peer (P2P) architecture. Bydoing so, we not only distribute the tasks of feature ex-traction, indexing and storage of image data into peers,we also introduce another aspect of searching in additionto filename-based method in prevalent P2P applications.Through the deployment of DISCOVIR, we introduce amodel to improve query efficiency targeting on content-based search.

As clustering technique is widely used in database andinformation retrieval system for organizing data and im-proving retrieval efficiency. We surmise such functionalityis valuable to a P2P distributed environment. In this pa-per, we introduce the concept of peer clustering at thelevel of overlaying network topology, and content-basedquery routing strategy to improve existing retrieval meth-ods. We design and implement a DIStributed COntent-based Visual Information Retrieval (DISCOVIR) sys-tem with content-based query functionality and improvedquery efficiency. We demonstrate its scalability and effi-ciency through simulation.

1 Introduction

The appearance of Peer-to-Peer (P2P) applications inrecent years have demonstrated the significance of dis-tributed information sharing systems by offering the ad-vantages of scalable storage space. However, since datastorage is decentralized, data location process need to bereformulated in order to achieve efficient lookup. Cur-rent researches focus on improving data access under thefilename-based lookup, i. e. filename is the only term toindex a file. We argue that such lookup process is in-sufficient when the data collection is huge or distributed,especially in the P2P environment. Users might annotatethe same file with different filenames, making the datalocation process error-prone and user dependent. We be-

lieve content-based information retrieval should be theultimate goal of a P2P system.

Regarding to current content-based image retrieval(CBIR) systems, we envisage the potential use of P2Pnetwork in both scattering data storage and distributingworkload of feature extraction and indexing. Through therealization of CBIR in P2P network, we can support enor-mous image collection without installing high-end equip-ment for the web server and incorporate individual user’scontribution compared to prevalent centralized web-basedsearch engine. Also, we make use of the computationpower of peers for image preprocessing and indexing inaddition to data storage.

By combing CBIR into a P2P environment, we 1) revo-lutionize the way of searching in existing P2P architectureand 2) overcome the bottleneck problem of running CBIRin a centralized fashion. In this paper, we introduce thedesign and implementation of DISCOVIR (DIStributedCOntent-based Visual Information Retrieval) [4, 26], aCBIR software running on Gnutella [7]. In the notionof CBIR in P2P, current searching methodologies in P2Psystem might not be applicable or efficient as they fo-cus on filename-based search. Therefore, we propose thePeer Clustering at the level of overlaying network topol-ogy, and Firework Query Model (FQM) [20] to facilitatecontent-based searching in P2P.

In the process of integrating CBIR into P2P and de-signing of algorithm for search efficiency improvement, wediscover some resemblance between P2P and Neural Net-works. We envision to model the P2P as Neural Networkmight be a potential to improve P2P system into a moreintelligent environment rather than a data sharing net-work. We try to give some preliminary ideas at the endof this paper and set this as the future direction.

In the coming section, we will detail the related worksin CBIR and P2P, and justify the motivation and ne-cessity of combining them. In Section 3.1, we introducethe architecture of DISCOVIR and the functionality ofits components, while in Section 3.2, we detail the PeerClustering and FQM for improving content-based searchin P2P. In Section 4, we present experiment result of sim-ulation of FQM and give some analysis. We address on

1

Page 2: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

how to extend current works in Section 5 and lastly, wegive concluding remarks in Section 6.

2 Background

2.1 Content-based Image Retrieval

In early image retrieval system, it requires human an-notation and classification on the image collection, thequery is thus performed using text-based information re-trieval method. However, there are several limitations forsuch implementation, they are Human Intervention,Non-Standard Description and Linguistic Barriers.These problems are due to the manual annotation on im-ages, which is tedious, error-prone and subject to individ-ual user perceptions and understandings.

In order to solve these problems, CBIR is proposedto pass such tedious task to computer. Since early1990’s, many CBIR systems have been proposed and de-veloped, some of them are QBIC [5], WebSEEK [28], SIM-PLIcity [30], MARS [16], Photobook [21], WALRUS [19]and other systems for some domain specific applications[12, 11]. In these systems, images are indexed by low-levelfeature like color, texture and shape information whichcan be extracted easily using computer. These systemsare not designed to be distributed across different com-puters in a network. One of the shortcomings is that thefeature extraction, indexing and also the query processingare all done in a centralized fashion which is computation-ally intensive and difficult to scale up. As indicated byseveral researchers [24, 27], one of the promising futuretrends in CBIR includes the distribution of data collec-tion, data processing and information retrieval. By ex-tending the centralized system model, we not only canincrease the size of image collections easily, but we alsoovercome the scalability bottleneck problem by distribut-ing the process of feature extraction and retrieval.

2.2 Peer-to-Peer

P2P network is a recently evolved paradigm for dis-tributed computing. With the emerging P2P networksand various implementations such as Gnutella [7], Nap-ster [18], LimeWire [13], YouServ [1], Freenet [6], Mor-pheus [17] and KaZaA [10], they focus on the retrievalof data based on filenames or metadata, such as ID3 tagof MP3, and offer the advantages of distributed resource[25], increased reliability [3] and comprehensiveness of in-formation [14].

P2P systems were classified by [15] into three differentcategories, namely: Centralized, Decentralized Unstruc-tured and Decentralized Structured.

Centralized - Napster [18] and SETI [25], to a certainextend, fall into this category, every peer in the P2P net-work keeps informing a central locations the informationof its shared data, while queries are issued to this cen-tral locations. Though Napster was extremely successful,

such scheme only utilize the data storage capability inP2P and subjects to legal problems and single point fail-ure. Due to this reason, Researches proceed to the othertwo paradigms or work on content anti-censorship anduser anonymity to bypass legal responsibilities.

Decentralized Unstructured - Gnutella [7] is an ex-ample of such designs. Figure 1 shows an example of atypical P2P network in this category. In the example, dif-ferent files are shared by different peers, while there is noindex information on where data are stored. When apeer initiates a search for a file, it broadcasts a query re-quest to all its connecting peers. Its peers then propagatethe request to their connected peers and this process con-tinues. Unlike the client-server architecture of the web,the P2P network aims at allowing individual computer,which joins and leaves the network frequently, to shareinformation directly with each other without the help ofdedicated servers. In these networks, a peer can becomea member of the network by establishing a connectionwith one or more peers in the current network. Messagesare sent over multiple hops from one peer to another,known as flooding, while each peer responds to queriesfor information it shares locally. Although this designsare extremely resilient to nodes entering and leaving thenetwork, they are unscalable and generates large loads forthe network participants.

Decentralized Structured - These systems have nocentral directory server, thus are considered decentral-ized. While there is a significant amount of structureimposing on the P2P overlay network in order to improvesearching in a decentralized fashion. Such highly struc-tured P2P designs are quite prevalent in the research liter-ature like CAN [22], Chord [29], Pastry [23] and Tapestry[32], they make use of hash function and scatter the stor-age of index in a controlled manner for efficient lookup ofdata.

Both CAN and Chord use Distributed Hash Table(DHT) to map the filename to a key, and each peer isresponsible for storing certain range of (key, value) pairs.When a peer looks for a file, it hashes the filename to a keyand ask the peers responsible for this key for the actualstorage location of that file. Chord models the key as anm-bits identifier and arranges the peers into a logical ringtopology to determine which peer is responsible for stor-ing which (key, value) pair. CAN models the key as pointon a d-dimension Cartesian coordinate space, while eachpeer is responsible for (key, value) pairs inside its specificregion. Such systems take a balance between the cen-tralized index and totally decentralized index approaches.They speed up and reduce message passing for the pro-cess of key lookup (data location); however, they incur apenalty for redistributing index storage when peers joinand leave the network frequently, especially in a dynamicenvironment like the Internet, thus the applicability in anenvironment of extremely transient population of nodesis still an unknown.

2

Page 3: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

2.3 Motivation

The development from Centralized to Decentralized Un-structured and to Decentralized Structured aims at im-proving data lookup efficiency by:

1. Placement of index - DHT-based algorithms im-prove the Decentralized Unstructured P2P by storingindex distributively for use in lookup process.

2. Replication - Freenet [6] and some other researches[2, 15] propose to replicate data in other peers to im-prove the query efficiency in the Decentralized Un-structured fashion. This method is merely a tradeoffbetween data storage and lookup efficiency.

We then ask: are we able to formulate a P2P model thatoptimizes for the data location process, while introducingcomparatively less penalty when peers join and leave thenetwork? Are we able to perform content-based searchingin this P2P network? Instead of distributing the storageof index into different peers, we allow a peer to index itsown data collection while retaining the original Gnutellanetwork topology but imposing a requirement on connec-tions to serve as the global indexing structure. We namethis architecture as Decentralized Quasi-structuredand use Self Orangization to improve query efficiency.Moreover, we introduce the content-based image searchfunctionality to illustrate the potential richness of queriesin a P2P network.

Figure 1: P2P information retrieval

3 DISCOVIR

3.1 System Architecture

In this section, we will describe the architecture of a DIS-COVIR client and the communication protocol in orderto perform CBIR over a P2P network. Through the DIS-COVIR program, users can share images among peers

around the world. Each peer is responsible for extractingand indexing the feature of the shared images, by doingso, every peers can search for similar images based on im-age content, like color, texture and shape among imagesshared by all DISCOVIR peer in the network.

Figure 2 depicts the key components and their inter-action in the architecture of a DISCOVIR client. AsDISCOVIR is derived from LimeWire [13] open sourceproject, the operations of Connection Manager, PacketRouter and HTTP Agent remain more or less the samewith additional functionality to improve the query mecha-nism used in original Gnutella network. Plug-in Manager,Feature Extractor and Image Indexer are introduced tosupport the CBIR task. The User Interface is modifiedto incorporate the image search panel, Figure 3 shows ascreen capture of DISCOVIR in the image search page.Here are the brief descriptions of each component:

• Connection Manager - It is responsible for set-ting up and managing the TCP connection betweenDISCOVIR clients.

• Packet Router - It masters the routing of mes-sage in DISCOVIR network between components andpeers.

• Plug-in Manager - It coordinates the storage ofdifferent feature extraction modules and their inter-action with Feature Extractor and Image Indexer.

• HTTP Agent - It is a tiny web-server that handlesfile download request from other DISCOVIR peersusing HTTP protocol.

• Feature Extractor - It collaborates with the Plug-in Manager to perform various feature extraction andthumbnail generation of the shared image collection.

• Image Indexer - It indexes the image collection bycontent feature and carry out clustering to speed upthe retrieval of images.

• Bootstrap Server - A server program maintain-ing an update list of currently available DISCOVIRpeers.

Connection Manager

Packet Router

Plug - in Manager

HTTP Agent

Feature Extractor

Image Indexer

DISCOVIR Network Shared Collection

DISCOVIR Core

DISCOVIR User Interface

WWW

Image Manager

Figure 2: Architecture of DISCOVIR

3

Page 4: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

Figure 3: Screen Capture of DISCOVIR

3.1.1 Flow of Operations and Functionality ofDISCOVIR Components

The following is a scenario walk-through to demon-strate the interaction between components. When a userchooses to preprocess his image collection, the FeatureExtractor collaborates with Plug-in Manager to extractcontent feature and generate thumbnails from all imagesin the shared directory. The result is passed to Image In-dexer for indexing and clustering purposes. In case whena new DISCOVIR client wants to join the network, theConnection Manager asks a Bootstrap server for a list ofcurrently available DISCOVIR clients in order to hook upto the network. Once a user initiates a query for similarimages, the Feature Extractor extracts feature from theexample image instantly, Packet Router is responsible forassembling an ImageQuery message and sending out tothe DISCOVIR network. For instance, when an Image-Query message is received from other peers, the PacketRouter checks for any duplication and propagates to otherpeers through DISCOVIR network. Meanwhile, it passesthe message to Image Indexer for searching similar im-ages. Upon similar images are found, an ImageQuery-Hit message is assembled and passed to Packet Routerfor replying the initiating peer. When ImageQueryHitmessages return to the initiating peer, its HTTP Agentdownloads thumbnails or full size images from other peersupon receiving user request from the user interface.

Preprocessing

Plug-in Manager is responsible for contacting web-site ofDISCOVIR to inquire the list of available feature extrac-tion modules, it will download and install selected mod-ules upon user request. Currently, DISCOVIR supportsvarious feature extraction methods in color and texturecategories such as AverageRGB, GlobalColorHistogram,ColorMoment, Co-occurrence matrix, etc [9]. All featureextraction modules strictly follow a predefined interfacein order to realize the polymorphic properties of switchingbetween different plug-ins dynamically, see Fig. 4.

Feature Extractor will extract feature and generatethumbnails for all images in the shared collection by usinga particular feature extraction module. Let I represent araw image data, f be the feature extraction method, thefeature extractor perform the task illustrated in Eq. 1,

f : I × θ → ~I (1)

where θ is the feature extraction parameter and ~I is theextracted feature vector. Image Indexer will then indexthe image collection using the multi-dimensional featurevectors ~I in order to answer an incoming query. It alsoclusters the set of feature vector for the sake of improv-ing query efficiency by acquiring statistical distributioninformation of the local image collection.

Compared with the centralized web-based CBIR ap-proach, sharing the workload of this computational costlytask among peers by allowing them to store and indextheir own image collection helps solving the bottle-neckproblem by utilizing distributed computing resources.

Connection Establishment

For a peer to join the DISCOVIR network, it connectsto the Bootstrap Server using the Connection Manager.The Bootstrap Server is responsible for storing a finite listof IP address of peers currently in the DISCOVIR net-work and randomly picks an IP address to return to thepeer. Once the IP address is received, the peer is able tohook up to the DISCOVIR network by connecting to theselected peer. In order to provide the Bootstrap Serverwith update list of IP address, the Connection Manager ofeach peer sends an alive message to the bootstrap serverperiodically after it has connected to the DISCOVIR net-work.

Query Message Routing

After a peer joins the DISCOVIR network, it may ini-tiate a query for similar images. The Feature Extractorprocesses the query image instantly and assemble an Im-ageQuery message to be sent out through Packet Router.Likewise, when other peers receive the ImageQuery mes-sages, they need to perform two operations, Query Mes-sage Propagation and Local Index Look Up.

• Query Message Propagation - In order to pre-vent query messages from looping forever in the DIS-COVIR network, two mechanisms are inherited fromGnutella, namely, the Gnutella replicated messagechecking rule and Time-To-Live (TTL) of messages.The replicated message checking rule can preventa peer from propagating the same query messageagain. The TTL mechanism constrains the horizonof a query message able to reach. After these twocheckings, the query message will be propagated tolinked peers through Packet Router.

4

Page 5: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

• Local Index Look Up - The peer searches its lo-cal index of shared files for similar images using theImage Indexer. Once similar images are retrieved,the peer delivers an ImageQueryHit message back tothe requester through Packet Router. Since imagesindexing is performed on the peer in preprocessingstage, the searching time can be speeded up.

Query Result Display

When an ImageQueryHit message returns to the re-quester, user will obtain a list detailing the location andsize of matched images. In order to retrieve the queryresult, the HTTP Agent will download thumbnails or fullsize image from the peer using HTTP protocol. On theother hand, HTTP Agent in other peers will serve as aweb server to deliver the requested images. This HTTPAgent is essential for integration to WWW which will bedescribed later in detail.

Figure 4: Downloading plug-in module for DISCOVIR

3.1.2 Gnutella Message Modification

The DISCOVIR system is built compatible to theGnutella network. In order to support the image queryfunctionalities mentioned above, two types of messagesare added. They are:

• ImageQuery - Special type of the Query message.It is to carry name of feature extraction method andfeature vector of query image, see Fig. 5

• ImageQueryHit - Special type of the QueryHitmessage. It is to respond to the ImageQuery mes-sage, it contains the location, filename and size ofsimilar images retrieved, and their similarity mea-sure to the query. Besides, the location informationof corresponding thumbnails are added for the pur-pose of previewing result set in a faster speed, seeFig. 6

Minimum Speed Feature Name 0 Feature Vector 0

Image Query 0x80

0 1 2 …

Figure 5: ImageQuery message format

Number of Hits Port IP Address Speed Result Set

Image Query Hit 0x81

0 1 2

Servant Identifier

3 6 7 10 11 ... n n+16

File Index File Size File Name Thumbnail information, similarity 0

0 3 4

0

7 8 ...

Figure 6: ImageQueryHit message format

3.2 Peer Clustering and Firework QueryModel

Our design goal for DISCOVIR is to improve data lookupefficiency for content-based search in a decentralized andunstructured P2P environment, while keeping a simplenetwork topology and number of message passing to aminimum. The DISCOVIR is built compatible to theGnutella network. There are two types of connections inDISCOVIR, namely random and attractive . Attrac-tive connections are to link peers sharing similar datatogether. We perform peer clustering at the level of over-laying network topology instead of locally shared data,thus content-based query routing is realizable to improvequery efficiency. As a result, it manages to be scalablewhen network grows. We have implemented a prototypeversion of DISCOVIR, built on top of LimeWire [13] withcontent-based image searching capability.

3.3 Peer Clustering

With the inherent nature of DISCOVIR network, we ap-ply notation in graph theory to model it, see Table. 1.For the sake of generality, we try to keep this in highlevel of abstraction. In the actual realization, we choosethe vector model in information retrieval literature as theunderlying data structure for representing data. Here aresome definitions:

Definition 1 We consider information shared by a peercan be represented in a multi-dimension point based onits content, and the similarity among files is based on thedistance measure between data points. Consider

f : cv → ~cv (2)

f : q → ~q (3)

f is the mapping function from file cv to a vector ~cv. Inthe notion of image processing, cv is the raw image data, fis a specific feature extraction method, ~cv is the extractedfeature vector characterizing the image. Likewise, f isalso used to map a query q to a query vector ~q, to be sentout when user makes a query.

5

Page 6: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

Table 1: Definition of TermsSymbol Description

G{V,E} The P2P network, with V denoting theset of peers and E denoting the set ofconnection

E = {Er, Ea} The set of connections, composed ofrandom connections, Er andattractive connections, Ea.

ea = (v,w, The attractive connection between peerssigv , sigw), v, w based on sigv and sigwv,w ∈ V, ea ∈ Ea

|V | Total number of peers.

|E| Total number of connections.

Horizon(v, t) ⊆ V Set of peers reachable from v withint hops

SIGv, v ∈ V Set of signature values characterizing thedata shared by peer v

D(sigv , sigv), Distance measure between specificv,w ∈ V signature values of two peers v and w.

Dq(sigv , q) Distance measure between a query q andsigv ∈ SIGv peer v based on sigv .

C = {Cv : v ∈ V } The collection of data shared in theDISCOVIR network.

Cv The collection of data shared by peer v,which is a subset of C.

REL(cv, q), A function determining relevance of datacv ∈ Cv cv to a query q. 1-relevant, 0-non-relevant

Definition 2 SIGv is the set of signature values repre-senting data characteristic of peer v, with each sigv rep-resenting each specific cluster of data. We define

sigv = (~µ, ~σ), (4)

where ~µ and ~σ are the statistical mean and standard de-viation of the collection of data belonging to a subcluster,C′v , C

′v ⊆ Cv. Thereafter, sigv characterizes certain por-

tion of data shared by peer p.

Definition 3 D(sigv, sigw) is defined as the distancemeasure between sigv and sigw, in other sense, the simi-larity between particular sub-cluster belonging to two dif-ferent peers v and w. It is defined as,

D(sigv, sigw) = || ~µv − ~µw||. (5)

|| ~µv − ~µw|| is the Euclidean distance between centroid oftwo sub-cluster symbolized by sigv, sigw. With this for-mula, we define the data affinity of two peers, we willlater use this to help organizing the network. The ac-tual implementation of this function can be varied, likethe five distance measures described in BIRCH [31], wechoose D0 as the distance measure between two clusters.

Based on the above definitions, we introduce a peerclustering algorithm, to be used in the network setup

stage, in order to help building the DISCOVIR as a self-organized network oriented in content affinity. It consistsof three steps:

1. Signature Value Calculation–Every peer prepro-cess its data collection and calculates a set of signa-ture values SIGv to characterize its data properties.Whenever the shared data collection, Cv , of a peerchanges, the signature value should be updated ac-cordingly. The whole data collection of the peer willbe divided into sub-clusters automatically by a clus-tering algorithm, e.g. k-means, competitive learning,and expectation maximization. The number of signa-ture values is variable and is a trade-off between datacharacteristic resolution and computational cost.

2. Neighborhood Discovery–After a peer joins theDISCOVIR network by connecting to a random peerin the network, it broadcasts a signature querymessage, similar to that of ping-pong messages inGnutella, to reveal the location and data character-istic of its neighborhood, Horizon(v, t). This task isnot only done when a peer first joins the network,it repeats every certain interval in order to maintainthe latest information of other peers.

3. Attractive Connection Establishment–By ac-quiring the signature values of other peers, one canreveal the peer with highest data affinity (similar-ity) to itself, and make an attractive connection tolink them up. When an existing attractive connec-tion breaks, a peer should check its host cache, whichcontains signature values of other peers found in theneighborhood discovery stage, and reestablish the at-tractive connection using peer clustering algorithmagain.

Having all peers joining the DISCOVIR network per-form the three tasks described above, you can envisiona P2P network with self-organizing ability to be con-structed. Peers sharing similar content will be groupedtogether like a Yellow Pages. Based on this content simi-larity based clustering, we will delineate a more complexquery strategy in the next section. The detail steps ofpeer clustering is illustrated in Algorithm 1, and Fig. 7depicts the peer clustering.

Algorithm 1 Algorithm for peer clusteringPeer-Clustering(peer v, integer ttl)for all sigv ∈ SIGv do

for all w ∈ Horizon(v, t) dofor all sigw ∈ SIGw do

Compute D(sigv , sigw)end for

end forEa = Ea ∪ (v,w, sigv , sigw) having min(D(sigv , sigw))

end for

6

Page 7: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

Figure 7: Illustration of peer clustering.

3.4 Firework Query Model

To make use of our clustered P2P network, we proposea content-based query routing strategy called FireworkQuery Model. In this model, a query message is routedselectively according to the content of the query. Once itreaches its designated cluster, the query message is broad-casted by peers through the attractive connections insidethe cluster much like an exploding firework as shown inFig. 8. Our strategy aims to minimize the number of mes-sages passing through the network, reduce the workloadof each computer and maximize the ability of retrievingrelevant data from the peer-to-peer network.

Clustered P2P network

Target Cluster

Query initiating peer

Figure 8: Illustration of firework query.

Here, we introduce the algorithm to determine whenand how a query message is propagated like a firework inAlgorithm 2. When a peer receives the query, it needs tocarry out two steps:

1. Shared File Look Up–The peer looks up its sharedinformation for those matched with the query. Letq be the query, and ~q be its vector representation,REL(cv, q) is the relevance measure between thequery and the information cv shared by peer v, itdepends on a L2 norm defined as,

REL(cv, q) =

{1 ||~cv − ~q|| ≤ T0 ||~cv − ~q|| > T,

where T is a threshold defining the degree of resultsimilarity a user wants. If any shared information

within the matching criteria of the query, the peerwill reply the requester. In addition, we can reducethe number of REL(cv, q) computations by perform-ing local clustering in a peer, thus speeding up theprocess of query response.

2. Route Selection–The peer calculates the distancebetween the query and each signature value of itslocal clusters, sigv , which is represented as,

Dq(sigv, q) =∑

i

qi − µiσi

, sigv = (µ, σ). (6)

If none of the distance measure between its local clusters’signature value and the query, Dq(sigv, q), is smaller thana preset threshold, θ, the peer will propagate the query toits neighbors through random connections. Otherwise, ifone or more Dq(sigv, q) is within the threshold, it impliesthe query has reached its target cluster. Therefore, thequery will be propagated through corresponding attrac-tive connections much like an exploding firework.

In our model, we retain two existing mechanisms inGnutella network for preventing query messages fromlooping forever in the distributed network, namely, theGnutella replicated message checking rule and Time-To-Live (TTL) of messages [7]. There is a modification onDISCOVIR query messages from the original Gnutellamessages. In our model, the TTL value is decrementedby one with a different probability when the message isforwarded through different types of connection. For ran-dom connections, the probability of decreasing TTL valueis 1. For attractive connections, the probability of de-creasing TTL value is an arbitrary value in [0, 1] calledChance-To-Survive (CTS). This strategy can reduce thenumber of messages passing outside the target cluster,while more relevant information can be retrieved insidethe target cluster because the query message has a greaterchance to survive depending on the CTS value. In theexperiment described thereafter, we choose CTS to be0, i. e. TTL of query message is not decremented whenforwarded through attractive link.

Algorithm 2 Algorithm for the Firework Query ModelFirework-query-routing (peer v, query q)for all sigv ∈ SIGv do

if Dq(sigv , q) < θ (threshold) thenif rand() > CTS thenqttl = qttl − 1

end ifif qttl > 0 then

propagate q to all ea(a, b, c, d) where a = v, c = sigv or b =v, d = sigv (attractive link)

end ifend if

end forif Not forwarding to attractive link thenqttl = qttl − 1if qTTL > 0 then

forward q to all er(a, b) where a = v or b = v (random link)end if

end if

7

Page 8: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

4 Implementations and Experi-mental Results

We have built a beta release of DISCOVIR. The client isimplemented in Java on top of the LimeWire open sourceproject. LimeWire is a Java implementation of GnutellaNetwork and its key components are under GNU Pub-lic License. We have released DISCOVIR as a free soft-ware and intend to release the source code in near fu-ture. Figure 9 illustrates the top ten results panel, itmakes use of the generated thumbnails in the preprocess-ing stage to avoid heavy traffic caused by downloading fullsize image directly, while users can preview lower qualityimages, Moreover, drawpad functionality is incorporatedin DISCOVIR. Besides supplying an example image forquery, users may draw their own sketch and search. Fig-ure 4 shows the plug-in module download page of DIS-COVIR. Currently, DISCOVIR supports AverageRGB,GlobalColorHistogram, LocalColorHistogram, ColorMo-ment and ColorCoherenceVector on color-based feature,Co-occurrence matrix, Auto correlation, Edge frequencyand Primitive length on texture-based feature. We areworking on shape-based feature and encourage contribu-tions from the community by implementing plug-ins.

Figure 9: Top 10 preview of images

4.1 Simulation

As it is difficult or even impossible to carry out perfor-mance evaluation by running several thousands of DIS-COVIR client simultaneously, we choose to predict theperformance and scalability by simulation. The two maingoals of our proposed routing algorithm in P2P networkare to: 1) increase the percentage of desired result re-trieved (Recall - R) and 2) decrease the percentage ofpeers visited (Visited - V ) for every query. Therefore,we define the query efficiency as R/V , the higher thisvalue, the higher the efficiency, and we investigate howthis quantity varies with different number of total peers.In the experiments, we generate a certain number of peersand randomly assign two to four classes of data points

Table 2: Parameters of simulationproperties synthetic and imagecardinality of dataset 20000number of classes 200number of peers {500,1000,2000,4000,

8000,10000,140000,20000}

dimensionality 9number of classes perpeer [2..4]number of local clusters {1,3}clustering method Expectation Maximizationavg. # of random links 1avg. # of attractivelinks dependsmax. in-degree of anode 40

to each of them. Then, we simulate the setting up ofDISCOVIR network using the peer clustering algorithm,Later, We initiate a query starting from a randomly se-lected peer and the retrieved data point is treated as rele-vant result if it belongs to the same class as the query datapoint. We simulate the environment using both syntheticdata and real data. For the synthetic data, each class ofdata points follows a Gaussian distribution and we gen-erate 200 classes of data points totally. For the real data,each class of data points is extracted from a category inCorelDraw’s Image Collection using the Color Momentplug-in module in DISCOVIR. Details of experiment pa-rameters are listed in Table 2. Fig. 10 shows the in-degreedistribution of peers under the simulated network. It issimilar to the real world Gnutella graph as indicated in[15] with a two-segment power-law distribution.

4.2 R/V Against Network Size

We measure the query efficiency against the number ofpeers with four different routing methods or data sets: (1)Brute Force Search (BFS), (2) FQM with 1 local clusterper peer using synthetic data ,(3) FQM with 3 local clus-ters per peer using synthetic data and (4) FQM with 1local cluster using real data. We did not simulate BFS be-cause the performance of flooding algorithm is equivalentto random probing [?]. As seen in Fig. 11, FQM outper-forms BFS algorithm and it can perform even better if anappropriate number of signature values per peer is used.As expected, recall and visited peers percentage in BFSare more or less equal because data classes are evenly dis-tributed among peers, the more peers visited, the moredesired data retrieved. The curve of FQM follows a bellshape with a long tail. Query efficiency increases at firstdue to two reasons: 1) the percentage of peers visited isinversely proportional to the number of peers when TTLis fixed and 2) FQM advances the recall percentage whenthe query message reaches the target cluster. When thenumber of peers increases further, a query might not reachits target cluster, so query efficiency starts to drop. Theresult shows that the improvement in FQM is reducedwhen real data is used, yet, it has on average about 20%improvement compared to BFS. This is when the real

8

Page 9: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

data cannot be clustered appropriately, see Fig. 13 andexplanation thereafter.

4.3 R/V Against TTL

Fig. 12 shows the measurement of query efficiency againstTTL of query message. Refer to the previous experiment,3 cases are tested, namely, (1), (2) and (4). The curveof FQM more or less follows bell shape because 1) WhenTTL value is low, the chance for first k walk to hit thetarget cluster is low and 2)further increasing the TTLvalue increases the chance to hit target cluster, thus R/Vreaches a maximum. It tails down because large TTLfloods the whole network with query message, while Vincreases, R/V in turn drops. From this experiment, weobserve that our proposed FQM depends on the choiceof TTL in order to achieve best performance. In thissimulation, we conclude that choosing 6 to be the TTLperform the best when the network consists of 8000 peers.

Fig. 13 might account for the inefficiency of Peer Clus-tering and FQM when using real data. From the Self-organizing map visualization of both synthetic data andreal data, we see that clusters in synthetic dataset aremore sparsely distributed, while clusters in real datasetare closely situated. Referring to Section 3.2, the distancemeasure between two peers is the distance between thecentroids of their clusters. As a result, global clusteringfor real dataset might be worse than synthetic dataset,thus, query efficiency is generally lower.

1 10 100 1000 100001

10

100Random Graph: 10000 nodes

# lin

ks

random graph

Figure 10: Distribution of in-degree of nodes in our sim-ulated network

5 Future Works

So far, we have built CBIR on P2P and proposed a basicframework for facilitating content-based search in P2P.The proposed method is not only applicable to image,but also other media, such as text and audio. Moreover,Gnutella 2 [8] will be coming out soon, we ought to figureout whether our proposed algorithm is still compatible ornot. Comparing with prevalent researches, our proposed

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

x 104

0

1

2

3

4

5

6

7R/V againt network size in 3 different search strategies

Number of peers

R/V

Synthetic Data 3 clustersSynthetic Data 1 clusterColor Moment Data 1 clusterBFS

Figure 11: R/V against number of peers

4 5 6 7 8 90.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8R/V againt TTL in 3 different search strategies of size = 8000

TTL of query message

R/V

Synthetic DataColor Moment DataBFS

Figure 12: R/V against TTL of query message

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

45

50SOM visulization of synthetic data

dimension x

dim

ensi

on y

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

45

50SOM visualization of color moment image data

dimension x

dim

ensi

on y

Figure 13: SOM visualization of data used in simulation

algorithm need more in depth mathematical analysis tomodel the observations from simulation, the resemblancebetween P2P and Neural Networks might be an opportu-nity for us to conduct a formal analysis on P2P.

Relevance feedback is commonly used in informationretrieval system, especially for CBIR, while current re-trieval methods in P2P focus on the one-shot approach.We envision the potential of adding relevance feedback

9

Page 10: P2P Information Retrieval: A Self-Organizing …oak.cs.ucla.edu/~sia/tp3.pdfP2P Information Retrieval: A Self-Organizing Paradigm Ka Cheung Sia Department of Computer Science and Engineering

in P2P system using the resemblance between P2P andNeural Networks. Competitive Hebbian Learning in theliterature offers a way to describe organization of datagiven a certain number of input signal (observation ofdata). It shines light on our proposed P2P DecentralizedQuasi-structured architecture for organizing the network.Users’ relevance feedback might be considered as inputsignal, while we use Competitive Hebbian Learning ruleto train the network.

6 Conclusion

In this paper, we outline the design and implementationof DISCOVIR, a CBIR software running on P2P envi-ronment. Moreover, we propose a peer clustering andcontent-based query routing strategy to retrieve informa-tion based on their content efficiently over the P2P net-work. Our algorithm is resilient to nodes entering andleaving the network frequently, thus, applicable in an en-vironment of transient population of nodes like the In-ternet. We verify our proposed strategy by simulationswith different parameters to investigate the performancechanges subject to different network size and TTL ofquery message. We argue that current researches in De-centralized Unstructured P2P cannot be applied in thenotion of CBIR, thus, we show that our FQM outperformsthe BFS method, the only existing algorithm available forcomparison, in both network traffic cost and query effi-ciency measure.

References[1] R. J. J. Bayardo, R. Agrawal, D. Gruhl, and A. Somani. Youserv:

A web-hosting and content sharing tool for the masses. In Pro-ceedings of 11th World Wide Web Conference, May 2002.

[2] E. Cohen and S. Shenker. Replication Strategies in UnstructuredPeer-to-Peer Networks. In Proceedings of SIGCOMM’02, Pitts-burgh, Pennsylvania, USA, August 19-23 2002.

[3] G. Coulouris, J. Dollimore, and T. Kindberg. Distributed SystemsConcepts and Design. Addison-Wesley, third edition, 2001.

[4] DIStirbuted COntent-based Visual Information Retrieval.http://www.cse.cuhk.edu.hk/∼miplab/discovir.

[5] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, N. W.,D. Petkovic, and W. Equitz. Efficient and Effective Querying byImage Content. Journal of Intelligent Information Systems: In-tegrating Artificial Intelligence and Database Technologies, 3(3-4):231–262, 1994.

[6] The Freenet homepage. http://freenet.sourceforge.net.

[7] The Gnutella homepage. http://www.gnutella.com.

[8] Gnutella2 Homepage. http://www.gnutella2.com/.

[9] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Ad-dison Wesley, first edition, 1992.

[10] KaZaA file sharing network. http://www.kazaa.com/.

[11] I. King and Z. Jin. Relevance feedback content-based image re-trieval using query distribution estimation based on maximum en-tropy principle. In Proceedings to the International Conference onNeural Information Processing (ICONIP2001), pages 699–704,November 14-18 2001.

[12] T. K. Lau and I. King. Montage: An image database for the fash-ion, clothing, and textile industry in Hong Kong. In Proceedingsof the Third Asian Conference on Computer Vision (ACCV’98),Lecture Notes in Computer Science, January 4-7, 1998.

[13] The LimeWire homepage. http://www.limewire.com.

[14] Modern Peer-to-Peer File-Sharing over the Internet.http://www.limewire.com/index.jsp/p2p.

[15] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and Repli-cation in Unstructured Peer-to-Peer Networks. In Proceedings ofthe 16th annual ACM International Conference on supercomput-ing, 2002.

[16] S. Mehrotra, Y. Rui, M. Ortega, and T. Huang. SupportingContent-based Queries over Images in MARS. In Proc. IEEE Int.Conf. Multimedia Computing and Systems, pages 632–633, 1997.

[17] The Morpheus homepage. http://www.musiccity.com.

[18] The Napster homepage. http://www.napster.com.

[19] A. Natsev, R. Rastogi, and K. Shim. WALRUS: A SimilarityRetrieval Algorithm for Image Databases. In Proc. SIGMOD,Philadelphia, PA, 1999.

[20] C. H. Ng and K. C. Sia. Peer Clustering and Firework QueryModel. In Poster Proc. of The 11th International World WideWeb Conference, May 2002.

[21] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Tools forContent-based Manipulation of Image Databases. In Proc. SPIE,volume 2185, pages 34–47, February 1994.

[22] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker.A Scalable Content-Addressable Network. In In Proc. ACM SIG-COMM, August 2001.

[23] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized ob-ject location and routing for large-scale peer-to-peer systems. InProc. of the 18th IFIP/ACM International Conference on Dis-tributed Systems Platforms, November 2001.

[24] Y. Rui, T. S. Huang, and S.-F. Chang. Image Retrieval: CurrentTechniques, Promising Directions and Open Issues. Journal of Vi-sual Communication and Image Representation, 10:39–62, April1999.

[25] The Search for Extraterrestrial Intelligence homepage.http://www.setiathome.ssl.berkeley.edu.

[26] K. C. Sia, C. H. Ng, C. H. Chan, and I. King. P2P Content-BasedQuery Routing Using Firework Query Model. In The 2nd Interna-tional Workshop on Peer-to-Peer Systems, submitted, Feb 2003.

[27] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jai.Content-Based Image Retrieval at the End of the Early Years.IEEE Transaction on Pattern Analysis and Machine Intelli-gence, 22(12):1349–1380, December 2000.

[28] J. R. Smith and S. F. Chang. An Image and Video Search Enginefor the World Wide Web. In Proc. SPIE, volume 3022, pages84–95, 1997.

[29] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrish-nan. Chord: A Scalable Peer-to-peer Lookup Service for InternetApplications. In Proc. of ACM SIGCOMM, pages 149–160, Au-gust 2001.

[30] J. Z. Wang, G. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive Integrated Matching for Picture LIbraries. In IEEETrans. on pattern Analysis and Machine Intelligence, volume 23,pages 947–963, 2001.

[31] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficientdata clustering method for very large databases. In In Proceed-ings of the 1996 ACM SIGMOD International Conference onManagement of Data, pages 103–114, 1996.

[32] B. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An infrastruc-ture for fault-tolerant wide-area location and routing. Technicalreport, Computer Science Division, U.C. Berkeley, April 2001.

10