random walks in peer-to-peer networks

12
1 Random Walks in Peer-to-Peer Networks Christos Gkantsidis, Milena Mihail, and Amin Saberi College of Computing Georgia Institute of Technology Atlanta, GA Email: gantsich, mihail, saberi @cc.gatech.edu Abstract— We quantify the effectiveness of random walks for searching and construction of unstructured peer-to-peer (P2P) networks. For searching, we argue that random walks achieve improvement over flooding in the case of clustered overlay topologies and in the case of re-issuing the same request several times. For construction, we argue that an expander can be maintained dynamically with constant operations per addition. The key technical ingredient of our approach is a deep result of stochastic processes indicating that samples taken from consecutive steps of a random walk can achieve statistical properties similar to independent sampling (if the second eigenvalue of the transition matrix is bounded away from 1, which translates to good expansion of the network; such connectivity is desired, and believed to hold, in every reasonable network and network model). This property has been previously used in complexity theory for construction of pseudorandom number generators. We reveal another facet of this theory and translate savings in random bits to savings in processing overhead. Keywords: Graph theory, Combinatorics, Statistics, peer-to-peer networks. I. I NTRODUCTION The simulation of a random walk, or more generally a Markov chain, is a fundamental algorithmic paradigm with highly sophisticated and profound impact in al- gorithms and complexity theory. Furthermore, it has found a remarkably wide range of applications in such diverse fields as statistics, physics, artificial intelligence (Bayesian inference), vision, population dynamics, epi- demiology, bioinformatics, among others. Recently, random walks have been proposed as pri- mary algorithmic ingredients in protocols addressing searching and topology maintenance of unstructured P2P networks. In particular: (a) Following extensive experimentation, [1] report that searching by simulating random walks has supe- rior performance as compared to the standard approach of searching by flooding. They attribute the improved performance of random walks to their adaptivity in termination conditions and hence granularity in coverage of the search space (in flooding, increasing the TTL by 1 may increase the space coverage exponentially). (b) [2] give a distributed algorithm for constructing and maintaining unstructured topologies with very strong connectivity properties, namely constant degree and con- stant expansion, with overhead per addition of a peer, where is the number of peers. At a very high level, when a new peer arrives, their protocol simulates a random walk on the existing overlay topology which, after steps, reaches a nearly uniformly random existing peer to which the new peer attaches. What are the analytic reasons of the success of the random walk method? Can we isolate one or two com- prehensible analytic primitives that explain the power of the method? Most important, can we translate these primitives to heuristics, or rules of thumb, for the use of the method in P2P network applications? Independent sampling from the uniform distribution is a primodal statistical, and hence algorithmic primitive. However, it is infeasible in many complex populations, such the set of nodes of a P2P network, since this set is not maintained anywhere and it is also quite dynamic. In this paper we make the following argument: In every case where uniform sampling from the set of nodes of a P2P network would have been a good algorithmic approach, the random walk method is an excellent candidate (i) to simulate uniform sampling, moreover, (ii) the number of simulation steps required can be as low as the number of samples in independent uniform sampling, which translates to constant network overhead, independent of the size of the network. In particular, beyond adaptivity to termination condi- tions and granularity in space coverage discussed in [1], we believe that the power of the random walk method can be pinned down into two kinds of analytic properties: The first analytic property, corresponding to (i), is that for any population whose members can be con- nected by links forming a connected graph, performing a random walk which started at any state and using the state of the random walk at time , as a sample point, simulates sampling with arbitrary accuracy, for

Post on 21-Oct-2014

1.293 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Random Walks in Peer-to-Peer Networks

1

Random Walks in Peer-to-Peer NetworksChristos Gkantsidis, Milena Mihail, and Amin Saberi

College of ComputingGeorgia Institute of Technology

Atlanta, GAEmail:

�gantsich, mihail, saberi � @cc.gatech.edu

Abstract— We quantify the effectiveness of randomwalks for searching and construction of unstructuredpeer-to-peer (P2P) networks. For searching, we arguethat random walks achieve improvement over floodingin the case of clustered overlay topologies and in thecase of re-issuing the same request several times. Forconstruction, we argue that an expander can be maintaineddynamically with constant operations per addition. Thekey technical ingredient of our approach is a deep resultof stochastic processes indicating that samples taken fromconsecutive steps of a random walk can achieve statisticalproperties similar to independent sampling (if the secondeigenvalue of the transition matrix is bounded away from1, which translates to good expansion of the network;such connectivity is desired, and believed to hold, in everyreasonable network and network model). This property hasbeen previously used in complexity theory for constructionof pseudorandom number generators. We reveal anotherfacet of this theory and translate savings in random bitsto savings in processing overhead.

Keywords: Graph theory, Combinatorics, Statistics,peer-to-peer networks.

I. INTRODUCTION

The simulation of a random walk, or more generally aMarkov chain, is a fundamental algorithmic paradigmwith highly sophisticated and profound impact in al-gorithms and complexity theory. Furthermore, it hasfound a remarkably wide range of applications in suchdiverse fields as statistics, physics, artificial intelligence(Bayesian inference), vision, population dynamics, epi-demiology, bioinformatics, among others.

Recently, random walks have been proposed as pri-mary algorithmic ingredients in protocols addressingsearching and topology maintenance of unstructured P2Pnetworks. In particular:

(a) Following extensive experimentation, [1] reportthat searching by simulating random walks has supe-rior performance as compared to the standard approachof searching by flooding. They attribute the improvedperformance of random walks to their adaptivity intermination conditions and hence granularity in coverage

of the search space (in flooding, increasing the TTL by1 may increase the space coverage exponentially).

(b) [2] give a distributed algorithm for constructingand maintaining unstructured topologies with very strongconnectivity properties, namely constant degree and con-stant expansion, with ���������� overhead per addition ofa peer, where � is the number of peers. At a very highlevel, when a new peer arrives, their protocol simulatesa random walk on the existing overlay topology which,after ���������� steps, reaches a nearly uniformly randomexisting peer to which the new peer attaches.

What are the analytic reasons of the success of therandom walk method? Can we isolate one or two com-prehensible analytic primitives that explain the powerof the method? Most important, can we translate theseprimitives to heuristics, or rules of thumb, for the use ofthe method in P2P network applications?

Independent sampling from the uniform distribution isa primodal statistical, and hence algorithmic primitive.However, it is infeasible in many complex populations,such the set of nodes of a P2P network, since thisset is not maintained anywhere and it is also quitedynamic. In this paper we make the following argument:In every case where uniform sampling from the setof nodes of a P2P network would have been a goodalgorithmic approach, the random walk method is anexcellent candidate (i) to simulate uniform sampling,moreover, (ii) the number of simulation steps requiredcan be as low as the number of samples in independentuniform sampling, which translates to constant networkoverhead, independent of the size of the network.

In particular, beyond adaptivity to termination condi-tions and granularity in space coverage discussed in [1],we believe that the power of the random walk methodcan be pinned down into two kinds of analytic properties:

The first analytic property, corresponding to (i), isthat for any population whose members can be con-nected by links forming a connected graph, performinga random walk which started at any state and usingthe state of the random walk at time � , as a samplepoint, simulates sampling with arbitrary accuracy, for �

Page 2: Random Walks in Peer-to-Peer Networks

2

bounded by well defined parameters of the graph. This israther intuitive, since we expect that graphs without any“bad” or “bottleneck” cuts will make the walk to “loosememory” and hence reach a random state quite fast. Inparticular, for a wide range of applications, it is possibleto construct sparse graphs so that �������������� . Suchfast convergence rates translate to practically efficientsimulation of uniform sampling.

The second analytic property, related to (ii), is sub-stantially more profound and rather counter-intuitive.It states that starting the random walk at a randomstate, simulating the walk for � steps, and using eachvisited node as a sample point, we may achieve samestatistical properties as � independent uniform samples.The reason why this is counter-intuitive is because of theobvious huge dependencies between successive steps ofa random walk, � of which have been used in the placeof independent uniform sampling.

In this paper we focus on two central issues of P2Pnetworks, namely searching and overlay topology con-struction. For both problems we have isolated scenariawhere independent uniform sampling would have been agood algorithmic primitive. We then simulate � indepen-dent samples with � successive (up to constant factors)states visited during the simulation of a random walk onthe P2P network. Finally, we compare the performanceof sampling by random walk to the performance ofthe standard, or previously known techniques in eachapplication. Our results and conclusions are:

(a) For searching relatively popular items, (whenuniform sampling is obviously beneficial) we found,experimentally, that searching by random walks performsbetter than flooding (for the same number of networkmessages) in two cases: (i) When peers in the topologyform clusters (representing, for example, thematic parti-tioning), so that the whole topology is arranged in 2 tiers,the lower representing peer clustering and the higherconnecting representatives from each cluster (e.g. super-nodes) to ensure good global connectivity. (ii) Whenthe same search request is re-issued repeatedly, in hopesof finding new peers, while the entire topology has notchanged dramatically (less than 40%). We believe thatboth scenaria are realistic. However, to the best of ourknowledge, they have not been considered in previousstudies.

We believe that our results in the context of searchingare intuitive. Our primary contribution was to formalizesetups of practical interest and translate our analyticintuition in these setups.

(b)For constructing and maintaining a P2P topologywith good connectivity properties, we turned to theapproach of [2]. As mentioned before, when a new peer

arrives, they look for a few nearly uniformly randomexisting peers to connect the new peer by simulating���������� steps of a random walk of the set of existingpeers. This causes ���������� network overhead per newlyarriving peer. We introduce a daemon construction tosimulate the second analytic property and connect anewly arriving peer with constant network overhead. Infact, there are strong dependencies between the edgesof peers that arrive closely in time. We give analyticevidence that these local dependencies do not affect theglobal connectivity properties of the network, up to con-stant factors. We also give strong experimental evidencethat our daemon construction simulates the algorithmof [2] (and other related constructions) (a)with overheada very small constant per arriving peer, (b)for networksup to 5M nodes (which is the current believed size ofKazaa; larger experiments were stressing the memorylimits of our machines), and (c)with truly negligiblepenalty in the quality of the connectivity of the overalltopology.

We believe that our results in the context of P2Pnetwork construction are particularly surprising. Froma practical point of view, they indicate that minimalamount of (correctly used) randomization suffices tokeep a dynamic network well connected. From a theoret-ical point of view, we believe that they lead to an excitingnew problem and paradigm in the study of the powerand necessity of randomness. All of our algorithms inthe context of construction were inspired by the secondanalytic property of random walks.

The isolation of the second property has been oneof the most celebrated results in complexity theory [3],[4], [5], [6]. In complexity theory this property has beenused as follows. Consider a randomized algorithm thatuses � random bits and has probability of success 1/2.By simulating the algorithm � times we may decreasethe failure probability to ������� . This needs ��� randombits. Now think of the nodes of a graph labeled by all� �!� -bit strings that be used to simulate a randomizedalgorithm. Consider a constant degree “expander” graph"

imposed on these � � nodes (there are known deter-ministic constructions of such “expander” graphs). Starta random walk from a uniformly random point of

"and

simulate a random walk on"

for � steps. This requires���#�� random bits to pick the initial point, and ���$�% bits to perform the walk. Then simulate the randomizedalgorithm on the �&� -bit strings visited by the randomwalk. The failure probability is ���'����� � , and yet we usedonly ���#�)(*�% random bits to perform the experiment!This observation is the basis of some of some of thestrongest known pseudorandom number generators withprovably good performance. In some sense, our work

Page 3: Random Walks in Peer-to-Peer Networks

3

can be also viewed as translating a complexity result,mostly known in the context of savings in random bits,into savings in overhead and improved performance in apractical networking context.

The balance of the paper is as follows: In Sec-tion II, we give the supporting theory, comparing couponcollection and Chernoff bounds to the correspondingstatements in random walks. All the technical parts ofthis section are known; their synthesis and relevance inthe context of networking is new. In Section III, weuse random walks to perform searching in P2P networksand compare this approach to searching using floodingand uniform sampling. Note that flooding correspondsto breath first search, whose statistics are not as wellquantified as those of random walks. In Section IV,we describe two algorithms for distributed constructionof P2P topologies with good expansion properties. Westress that our experiments in Sections III and IV are onthe size of current P2P networks. Many implementationaland other details are suppressed to emphasize clearlythe new ideas and for lack of space. We conclude inSection V.

II. STATISTICAL ESTIMATION AND RANDOM WALKS

In this section we focus on the statistical propertiesof sampling performed (a)by ideal independent draws,and (b)by simulating a random walk. In the case ofindependent sampling we are interested in the number ofsamples sufficient to achieve a certain statistical property.In the case of random walks we are interested in thenumber of simulation steps sufficient to achieve the samestatistical property.

For comparison, we consider two common abstrac-tions, namely the coupon collection problem and Cher-noff bounds for independent Bernoulli trials; these ab-stractions refer to sampling by independent draws. Forsampling by random walk simulation, we consider thecover time which is the suitable analogy to couponcollection, and the trajectory sample average which isthe suitable analogy to independent Bernoulli trials.

We observe that, for both abstractions, the overhead ofthe random walk simulation method is determined onlyby the second eigenvalue of the probability transitionmatrix of the random walk. In particular, when thesecond eigenvalue is constant (independent of the size ofthe graph) the random walk method achieves the samestatistical characteristics as independent sampling, up to���+ notation.

The reason why the second eigenvalue provides suchclean characterizations is that it is intimately relatedto global connectivity properties of the graph, namelyexpansion and conductance. Intuitively, expansion and

conductance express the worst-case “cuts” of the graph,and it is natural to expect that when the graph does nothave “bad cuts” a random walks approaches its stationarydistribution very fast, and hence sampling by randomwalk “mimics” well independent sampling.

A. Coupon Collection and Chernoff Bounds

The coupon collection problem is the following: Supposethat there are � distinct types of coupons. At each step,we draw a coupon whose type is uniformly distributedamong all � types. Let , � be the time by which we haveencountered coupons belonging to all � distinct types. Itis well known [7] that-�. , �0/ �1�2( ��435� ( ��637� (98:8:8�(;�4�<���#�=���>�� ?8 (1)

Let @ be a constant, ACB)@�B!� . Let ,ED � be the time bywhich we have encountered coupons belonging to @F�distinct types. It is also well known that-�. ,GD �0/ �1�H( ��)35� (I8:8:8J( ��)3K@E��(L� � ��M3K@ ���#�� !8

(2)We proceed with an outline of Chernoff bounds[8]. LetNPORQ 8:8:8 QSN � be independent Bernoulli trials with T�U . N�V �� / �7W and T>U . N�V �LA / �X�E3YW , A�Z&W�Z[� , �\Z5]�Z^� .

LetN �`_ �V�a�O NbV

, hence-�. N / �c�2W . In a searching

context, where W denotes the probability that a randomlydrawn object has a desired property, we are interested inthe probability that the property is found in substantiallyfewer draws than its frequency in the search space. Thiscorresponds to the event

N Z?�'� 3Mde '�2W , for AfB�d%B6� . Forthis event, Chernoff bounds areT>U . N Z[�'�g3hde '�2W / Z^i�j�k�l�mn 8 (3)

In a measurement context, where W denotes the frac-tion of objects satisfying a certain property, we areinterested in the quality of the estimator

N ��� for W . Now,for AfB�d%B�Ao8qp , Chernoff bounds are:

T>U=r+s N � 3PWts�u*d$W%v�Z*�i j�k n l$mn#w 8 (4)

B. Random Walks, Convergence, Cover Time and Tra-jectory Sample Average

Let xb�+y Qz- be an undirected connected graph, s{yPs��� . Let | V denote the degree of vertex ] , �MZ!]EZ!� . Let|�}f~{�G�P����� OS�%V#� �%� | V�� . Let �P� ��� V���� , �GZb] Q�� Zb� , be theadjacency matrix of x . Let � be the transition matrixof the random walk on x , where a particle that is onvertex ] at time � , moves to a neighbor of ] at time �S(�� ,chosen uniformly at random among all neighbors of ] : Itis well known and easy to verify that the above random

Page 4: Random Walks in Peer-to-Peer Networks

4

walk has a unique stationary distribution �� , in the sensethat �� ������ , with � V ��| V ���%s - s , ��Z�]YZ�� , and let� }f~{���)|�}f~{�����%s - s .

Let � be a 0-1 function on y , ����y �3�� � A Q � � . LetW be the probability mass of vertices that take the value1 under � under the stationary distribution �� , which isthe same as the mean of � under �� :WP� ��:� f¡ ��¢ �e£ a�O

� � �¤��:��  �f�#¥o � � 8 (5)

Of particular interest are the following three met-rics: (a)Convergence rate, which is the rate with whichthe random walk approaches the stationary distribution.(b)Cover time, which is the time when the random walkhas visited all vertices at least once. This is analogous tothe coupon collection abstraction, and we wish to havebounds comparable to (1) and (2). (c)Trajectory sampleaverage, which is the rate with which the value of � ,averaged over successive vertices of a trajectory of therandom walk, approaches W . This is analogous to theChernoff bound abstraction, and we wish to have boundscomparable to (4). In the next paragraph we point outbounds for all the above metrics in terms of the secondeigenvalue of � .

C. Bounds in terms of the Second Eigenvalue

In general, a vector �¦ is an eigenvector of � witheigenvalue § if and only if �¦ ����§¨�¦ . Thus, � is aneigenvector of � with eigenvalue 1. It is well knownthat � has � real eigenvectors with corresponding eigen-values �©�ª§ O©« §�¬*u­8:8:8bu®§ � u 3=� [9], [10];the strict separation of the first and second eigenvaluesfollows from the connectivity of x . The first eigenvalue§ O �!� which corresponds to eigenvector �� characterizesstationarity. We may also assume that s{§¯¬Hs « s{§ � s (largenegative eigenvalues concern strong periodicities, likebipartiteness, which we may exclude for the purposesof this paper).

Consider a random walk on x according to thetransition matrix � , starting from an arbitrary vertex,or an arbitrary distribution on y . Let °H± be the vertexthat the random walk visits at time � , �hZ^�²Z<³ . Tobound the convergence rate of the random walk we focuson the so-called variation distance which, at time � , is´ �#�S µ�¶��· ¸g¹gº   s�T>U . ° ±¼»¾½ / 3X�� � ½ :s . The following isknown [11]:

´ �#�S CZ � j O}f~{� § ± ¬ .Let ¿ � be the time by which the above random walk

visits all the vertices of x . [12], [13] show:-�. ¿ �H/ Z ��À � j O}f~{� �������o�'�²37§�¬R �Á� �*�#�=�������o�'�²3¾§%¬R S Q� �U � }f~{�G�)Ã=� O� e8(6)

Compare (6) to (1) and realize that, for constant §�¬ ,they both solve coupon collection in the same order ofmagnitude.

Let ¿CD � be the time by which the random walk visits@E� distinct vertices of x , for some constant @ , AfZ�@hZ6� .It is straightforward to derive from [12], [13] (and isfolklore among probabilists) that-I. ¿CD �H/ Z OO j D � À � j O}f~{� �o�'�²37§�¬� Á� OO j D �5�#���o�'�²3¾§%¬R S QX �U � }f~{�G�)Ã=� O� e8(7)Compare (7) to (2) and realize that, for constant §�¬ , theyboth solve partial coupon collection in the same order ofmagnitude.

Recall that ° ± denotes the vertex that the random walkvisits at time � . Let ĵ±2�6�f�#°±+ . Suppose that we simulatethe random walk for

��� ��� � j O}f~{��²37§ ¬ u ��� � j O}f~{����M§ j O¬ (8)

steps (very roughly, this guarantees good approach toto stationarity according to the bound on

´ � and thenuse the next � steps as sample points. We thus letÄ`� _&ÅRÆ �± a ÅRÆ O Ä ± . The particularly strong result is that,using Ĩ��� as an estimator for W , is of the same qualityas Chernoff bounds which refer to totally independentsamples. Thus, despite the local dependencies introducedby consecutive steps of the random walk, the overalldistribution of the vertices visited by the random walk iswell spread across the sample space. In particular, (9)below is to be compared to (4):

T�U . s Ä � 3PWts�u*d$W / Z*Çi j k n l$m n�ÈÊÉ�Ë2Ì n'ÍnJw 8 (9)

The above result was obtained (in increasingly strongerforms and referring to pseudorandom number simula-tion) in a sequence of celebrated complexity theorypapers [3], [4], [5], [6]. The version that we give aboveis from [6].

D. Second Eigenvalue, Expansion and Conductance

For ½LÎ y , define the cutset of ½ , ¿�� ½ , as the set ofedges with one endpoint in ½ and the other endpoint isϽ . Define the volume of ½ as the sum of the degrees ofvertices in ½ : Ð����� ½ � _ �R� ¹ | � . The expansion and theconductance of x are:Ñ � �����½©Ò ys ½ s�Z�s{yPs{���

s{¿�� ½ :ss ½ s Q1Ó � �����½©Ò yÐ����� ½ ¨ZÐ���Ô�+y� S���s{¿�� ½ :sÐ����� ½ 8

(10)

Page 5: Random Walks in Peer-to-Peer Networks

5

In addition, the following bound is known [11]:

�Õ3K� Ó Z^§�¬YZ<�²3 Ó ¬� 8 (11)

Finally, realize that in graphs where we have bounds onthe minimum and maximum degrees, both expansion,conductance and eigenvalues are easily related. In par-ticular, in regular graphs, if any of these metrics is aconstant then all of them are constants.

In summary, for families of graphs where §¯¬ isconstant, consecutive states of random walks are ex-cellent candidates to approximate independent uniformsampling. Since, §�¬ constant is equivalent to expansionÑ

constant, and constant expansion is equivalent to goodglobal connectivity, it is reasonable to try the randomwalk approach in communication networks. Good globalconnectivity is desired and believed to hold in all rea-sonable networks and network models [14], [15], [16],[17], [18], [19].

III. SEARCHING

In this section we study the performance of searchingusing flooding and random walks, and compare the twomethods to each other and to a baseline case of indepen-dent uniform sampling. We measure the performance interms of the average number of distinct copies of anitem located in the search, the probability of not findingany copy of the item, even though there are copiesin the network, and the number of messages that thesearching algorithm uses. We show experimentally thatrandom walk is better than flooding, if at least one ofthe following conditions holds:Ö The user issues multiple search requests for the

same item and between two consecutive requeststhe topology changes relatively slowly (in the sensethat two consecutive snapshots of the topology arehighly correlated).Ö There is peer clustering. That is, there are com-munities in the topology, with dense connectivitybetween peers in the same community and sparseconnectivity between peers of different communi-ties.

We believe that the two scenarios introduced aboveare important for the following reasons:

In practice, when a user issues a request, the user (or,the system on behalf of the user) re-issues the samerequest multiple times hoping to locate more sources.Consecutive floodings take advantage of the changes inthe topology that happen in between and can discovermore sources. If the topology however remains mostlyunchanged between consecutive request, then the newfloodings will mostly discover sources already known.

On the other hand, for the same number of messages,the random walk follows totally different trajectories andhas better chances to discover new sources. Note that inprevious work, searching has been always modeled asan one time process. However, we believe that studyingsearching under multiple requests and a changing topol-ogy is realistic and important.

The motivation behind studying peer clustering be-comes clear if we consider the process by which theP2P network is formed. Each peer keeps a cache of otherpeers and picks its neighbors from its cache. The cacheis populated by the addresses of peers that answeredprevious queries [20]. Thus, intuitively, the cache con-tains addresses of peers that have similar interests. It istherefore reasonable to expect that this process leads tothe formation of communities of users. The exact processby which P2P networks are formed is largely unknownand thus peer clustering is at this point only a hypothesis.But, we believe that it is a fair hypothesis based both onour practical experience with P2P systems, and on theobservation that most networks grown in a decentralizedway exhibit strong clustering properties. See also [21],[22], [23], [24] and for related discussion [25].

The rest of the section is organized as follows. First,we give the methodology. Then, we discuss the mostsimple case of a topology without clustering that does notchange with time. In this scenario, we find that floodingand random walk behave similarly. Next, we examinetopologies with peer clustering. Next, we examine multi-ple re-issues of a request in topologies that change slowlywith time. In the last part we discuss power-law graphsand real topologies. Random walks behave better thanflooding in most cases of interest.

A. Methodology

In all our experiments we assume that copies of the itemto be discovered populate @t× of the peers, where @is a parameter with Ao8�Ag��× ZØ@®ZØAo8���× . Assumingthat a single search reaches 10000 distinct nodes, atypical value of the horizon of a user in the Gnutellanetwork [20], the search will result, in expectation, in1 to 10 distinct copies found for the range of @ ’s wehave experimented with (this represents items that are notrare). The performance of each searching technique ismeasured as the number of distinct copies located whensimulating the searching algorithm from a randomlychosen peer of the topology. To make statistically robustconclusions, we repeat the experiment from a set ofrandomly chosen peers, typically 500 peers, and studythe distribution of the number of distinct copies located(hits).

Page 6: Random Walks in Peer-to-Peer Networks

6

a) Performance Metrics: A metric that summarizesthe distribution of the hits is the mean. Of equal impor-tance, however, is the discrepancy around the mean, andfailure probability (probability of no copies discovered).Even when the means of random walks and floodingare the same, it is almost always the case that thediscrepancy and failure probability of random walk issubstantially better than flooding (e.g. see Figure 2).We therefore measure Mean, standard deviation Std, andFailure probability.

b) Cost: We measure the cost of each searchingtechnique as the number of messages or queries per-formed during the search. When comparing differentalgorithms, it is always under the assumption of usingthe same number of messages.

c) Peer-to-peer topologies: Available topologies ofcurrent P2P networks are limited in size and of ques-tionable quality due to the collection method (topologiesfrom [26] have only 30-40K nodes, when current P2Pnetworks have hundreds of thousands and perhaps mil-lions of users). We have therefore experimented on syn-thetic topologies of up to 1 million nodes. Experimentingwith extremal synthetic topologies has the additionaladvantage of facilitating the demonstration of generalprinciples.

We have used the following models to generate syn-thetic P2P topologies:Ö Flat regular expanders. This is a canonical example

of regular graphs with good expansion properties.We use expanders since expansion is desired andbelieved to hold in every reasonable network andnetwork model [17], [2], [18]. We have used 6-regular expanders.Ö Two-tier topologies with clustering. To study theeffects of peer clustering we have started by con-structing a number of isolated regular expanders thatcorrespond to the clusters. Then, from each clusterwe pick a small number of nodes at random andconnect them using another regular expander.Ö Power-law graphs. Many important networks thatarise in a decentralized fashion are known to havepower-laws [15], [23], [27]. Some researches arguethat P2P topologies may also possess heavy tails[28]. We have used the standard model of growthwith preferential connectivity to generate power-lawrandom graphs (this model runs in linear time andhence can efficiently generate graphs of very largesize).Ö Samples of real topologies. We have used partialviews of the Gnutella topology made available in[26]. These topologies are limited in the number ofpeers (around 35K) and of questionable accuracy,

since the topology evolves during the topologydiscovery process, some peers are uncooperativeand for other practical reasons. Because of theirvery limited size, our results are inconclusive.d) Dynamic Topologies: The dynamic nature of the

P2P topologies is a crucial parameter of these systems[29]. However, very few things are known about the waythese topologies evolve over time. To model the dynamicnature of the P2P topologies we have used the followingheuristic: We perform a number of “rewirings”; for eachrewiring we pick two edges uniformly at random andexchange their end points. The number of rewirings isa parameter that is related to the speed by which thetopology is changing. In our experiments the measure-ments are happening before and after the rewirings andnot during the process of changing the topology. The sizeof the topology remains unchanged during the changes,since we want to capture only effects of changes inthe connectivity of the topology and not the effects ofchanges in the number of peers.

In our experiments we measure the speed by whichthe topology is changing as the ratio of the number oflinks changed by performing rewirings over the totalnumber of links. We have experimented with ratios inthe range from 2% to 40%. The rate of change in currentP2P networks is not known and is difficult to estimate.However, from our practical experience, we observe thatconsecutive searches that happen every 10-20min do notresult in differences in hosts discovered that would havebeen expected if a large fraction of the network haschanged.

e) Content placement: The straightforward ap-proach is to pick the nodes that will host the copiesuniformly at random from the entire population. This iswhat we have used in our experiments. We have alsoexperimented with cases where the nodes that host theitem are close to each other in the topology (contentclustering). In our experiments, we have observed thatcontent clustering affects the performance of searchingby flooding or random walk much less than peer cluster-ing, or re-issuing of the same request. We therefore donot present this case.

B. Flat Topologies with Uniformly Distributed Content

We start by examining a scenario in which the per-formance of flooding and random walks is similar. SeeTable I. We study the performance of issuing a requestonly once in a flat regular topology of 500K peers. Aftersimulating the flooding algorithm with a time to live(TTL) of 5 and counting the number of messages, we runthe random walk algorithm and configured it to use the

Page 7: Random Walks in Peer-to-Peer Networks

7

TABLE IPERFORMANCE OF SEARCHING IN A STATIC TOPOLOGY WITHOUT

PEER CLUSTERING.

Attribute Flooding RW UniformMean 8.712 8.796 10.990Std 3.01 2.93 3.22Min 1 2 3Messages 22331 22331 22331Unique peers 17235 17431 21839

Note: 500K peers, Ù\Ú6ÛRÜ ÛÞÝàß . Min is the minimum number of hitsover all searching requests. Unique peers is the number of distinctpeers discovered during the search. Observe that flooding and RWhave very similar performance, while uniform sampling is better.

0 100 200 300 400 5000

2

4

6

8

10

12

14

16

18

# of

hits

FloodingRandom WalkUniform

Fig. 1. Sorted number of hits when searching from 500 randomlychosen peers. Topology of 500K peers, Ù&Ú^Û�Ü ÛàÝàß . Observe thatflooding and random walk have very similar performance.

same number of messages. Observe that the mean andminimum numbers of hits, and the standard deviationof the hits distribution of both flooding and randomwalk are roughly the same, while independent uniformsampling is better. Moreover the entire distribution ofhits, given in Figure 1, is similar for both random walkand flooding.

We have experimented with topologies of various sizesand for various popularities of files, with Ao8�Ag��×áZ^@¾ZAo8���× , and found that always the performance of floodingand random walk are similar, when both are allowed touse the same number of messages.

Observe that compared to the study of [1] we useonly one walker. The results were similar when using alarger number of walkers assuming that the total numberof messages stays the same. The use of more walkersincreases the user-perceived delay, which is a parameterwe do not study in this paper.

C. Topologies with Peer Clustering

We now examine a topology with well separatedcommunities of peers and show that random walk hasbetter performance than flooding. The example topology

0 100 200 300 400 5000

2

4

6

8

10

12

# of

hits

FloodingRandom WalkUniform

Fig. 2. Sorted number of hits in a topology with peer clustering. Thedistribution of random walk is more concentrated around the mean.Topology of 200K peers, Ù\Ú)Û�Ü ÛÞÝeß .

is constructed as follows. We generate five flat regulargraphs each with size 40K. From each topology wepick 1000 nodes at random (for a total of 5K nodes)and construct another flat regular graph on the selectednodes. The final P2P network is the composition of alltopologies.

The performance of each searching algorithm run onthe example topology for different content popularities isgiven in Table II. First, observe that the average numberof hits using flooding was slightly better than usingrandom walks, but, given the large standard deviations,the differences are not statistically significant. Eventhough the two schemes behave similarly with respectto the average number of hits, they have totally differentbehavior when it comes to the failure rate and theminimum number of copies found. For @®�âAo8�Ag� %,flooding failed in 28.8% of the cases and random walkin only 10.8%, an improvement of nearly a factor of 2.

The fundamental difference between the two searchingalgorithms is that the number of hits in the case ofrandom walks appear more concentrated around themean (observe also the entire distribution of hits inFigure 2). Indeed, the standard deviation in the case ofrandom walks is much smaller compared to the standarddeviation of flooding. Obviously, the optimal concentra-tion around the mean is achieved by uniform sampling.The basic strength of random walks, following Section IIis precisely that they resemble uniform sampling in aquantifiable way.

D. Re-issuing the Same Query

In this section, we study the performance of floodingand random walks in flat regular topologies under the as-sumption that users issue the same query multiple timeswhile the topology is changing. Realize that dynamictopologies favors only flooding, while random walks will

Page 8: Random Walks in Peer-to-Peer Networks

8

TABLE IIPERFORMANCE OF SEARCHING IN A TOPOLOGY WITH PEER CLUSTERING.

Method ãtäCå�æ å�çJè ãtäCå�æ å+é$è ãfäCå�æ çêå+èMean Std Min Failure Mean Std Min Failure Mean Std Min Failure

Flooding 1.754 1.91 0 28.8% 9.308 6.44 1 0.0% 18.192 11.04 5 0.0%RW 2.124 1.38 0 10.8% 10.860 3.21 2 0.0% 21.764 4.54 10 0.0%Uniform 2.676 1.52 0 6.6% 13.496 3.26 5 0.0% 27.274 4.87 15 0.0%

Note: Topology of five clusters for a total of 200K peers. ë n äCå�æ ì$ì+é�í .

be largely unaffected. In fact, in the worst case wherethe topology is not changing, re-issuing a flooding query� times does not find new copies. On the other hand,according to (7) and (9), increasing the length of therandom walk by factor � has substantial impact.

The performance of the different algorithms fortopologies of various sizes and for various values of@ is given in Table III. Our experimental methodologyis as follows: Each peer initiates a searching requestand waits for the results. Then, we change the topologyby performing rewiring operations to 2% of the links.Then, each peer initiates a new searching request andthis process continues 4 times. In the end, we count thenumber of distinct items found for each peer.

Table III indicates that random walks have betterperformance compared to flooding with respect to boththe average number of hits and the probability of failure.The average number of hits for random walks was atleast three times better compared to the same num-ber for flooding. Also, the failure probability droppedsubstantially. This great performance improvement wasexpected since, even though we change a certain numberof links, the overall topology remains relatively stableand successive flooding searches do not result in manynew items found. On the other hand, prolonging therandom walk, or, successive random walks from the samepeer, follow totally different sampling paths and havebetter chances of locating new copies of the requesteditem.

The performance of successive searches depends onthe number of topology changes that take place betweenthe consecutive searches. We have studied this effectand report the results in Table IV. We observe that theperformance of flooding increases as the rate of topo-logical changes increases. For very fast rates of change( î�AH× at each step in our experiment), the performance offlooding becomes comparable to that of random walks,since effectively the neighborhood of each node changesalmost completely between consecutive searches. On theother hand, the performance of random walk remainsrelatively unaffected by the changes in the topology.

TABLE IVPERFORMANCE OF SEARCHING IN DYNAMIC TOPOLOGIES AS A

FUNCTION OF THE RATE OF CHANGES.

Links Flooding Random WalkChanged Mean Std Failure Mean Std Failure

2% 0.488 0.67 60.6% 1.398 1.14 24.6%4% 0.644 0.82 53.0% 1.382 1.11 23.6%

10% 0.888 0.86 38.0% 1.450 1.11 21.4%20% 1.162 0.99 27.6% 1.456 1.12 20.8%40% 1.460 1.12 20.0% 1.378 1.13 23.8%

E. Real topologies and topologies with power-law statis-tics

In the previous analysis we have experimented onflat regular graphs. Similar results hold for topologieswith heavy-tailed statistics as well as in real topologies.In Figure 3, we show the distribution of hits for atopology of 500K nodes generated with the model ofgrowth with preferential connectivity [30], and for a realGnutella topology taken from [26]. Again, observe thatthe distribution of hits in the case of random walk is moreconcentrated around the mean compared to flooding.Indeed, in the real topology, the mean number of hits,the standard deviation and the failure rate were 0.514,2.15 and 81% respectively in the case of flooding in thereal topology, and 0.538, 0.73 and 59% in the case ofrandom walk. Similar results apply for the graph grownwith preferential connectivity.

Observe that the very small TTL used for floodingin the case of the real topology was due to the smallsize of the topology. Increasing the TTL to 3, wouldhave resulted in reaching almost half of the nodes withflooding and would have skewed the statistics. It is morerealistic to expect that searching visits only a smallportion of the graph.

In conclusion, we expect that our results apply toall graphs with good expansion property as expectedfrom the theoretical section. In addition, we expectthat typical networks, like P2P networks, have goodexpansion properties, otherwise, they would have notscaled easily from a few tens of thousands to a fewmillion nodes.

Page 9: Random Walks in Peer-to-Peer Networks

9

TABLE IIIPERFORMANCE OF SEARCHING IN DYNAMIC TOPOLOGIES.

A. FloodingSize (K) ãtäCå�æ å�çJè ãtä¨å�æ å+é$è ãtäCå�æ çêå+è ë n

Mean Std Min Failure Mean Std Min Failure Mean Std Min Failure100 0.488 0.67 0 60.6% 2.294 1.45 0 8.6% 4.566 2.08 0 1.4% 0.79300 0.412 0.64 0 66.2% 2.350 1.57 0 9.4% 4.918 2.26 0 0.4% 0.79500 0.550 0.75 0 58.6% 2.562 1.68 0 8.8% 4.992 2.27 0 0.6% 0.79

1000 0.494 0.62 0 60.0% 2.684 1.72 0 8.7% 5.032 2.31 0 0.2% 0.80B. Random Walk

Size (K) ãfäCå�æ å�çJè ãïä¨å�æ å+é$è ãtäCå�æ çêå+è ë nMean Std Min Failure Mean Std Min Failure Mean Std Min Failure

100 1.398 1.14 0 24.6% 7.058 2.40 1 0.0% 14.076 3.30 5 0.0% 0.79300 1.436 1.12 0 20.2% 7.396 2.62 1 0.0% 14.894 3.87 5 0.0% 0.79500 1.562 1.19 0 19.8% 7.634 2.78 1 0.0% 15.152 3.88 7 0.0% 0.79

1000 1.518 1.20 0 22.4% 7.544 2.71 1 0.0% 14.982 3.91 4 0.0% 0.80C. Uniform Sampling

Size (K) ãfäCå�æ å�çJè ãïä¨å�æ å+é$è ãtäCå�æ çêå+è ë nMean Std Min Failure Mean Std Min Failure Mean Std Min Failure

100 1.734 1.18 0 13.8% 8.634 2.83 1 0.0% 17.210 3.75 7 0.0% 0.79300 1.864 1.23 0 12.6% 9.212 2.90 1 0.0% 18.496 4.07 9 0.0% 0.79500 1.872 1.38 0 15.2% 9.486 2.98 2 0.0% 18.924 4.21 10 0.0% 0.79

1000 1.876 1.34 0 14.2% 9.482 3.14 0 0.2% 18.850 4.34 8 0.0% 0.80

A. Growth with preferential connectivity model with500K nodes. (TTL=4)

0 100 200 300 400 5000

1

2

3

4

5

6

7

8

9

10

# of

hits

FloodingRandom WalkUniform

B. Sample from the Gnutella network with 36K nodes.(TTL=2)

0 100 200 300 400 5000

2

4

6

8

10

12

14

16

18

# of

hits

FloodingRandom WalkUniform

Fig. 3. Performance of searching in A) a network with heavy-tailed statistics, and B) in a real topology. The small TTL in thereal topology is due to the fact that using a larger TTL would haveresulted in reaching with flooding almost half of the nodes, which isunrealistic.

IV. CONSTRUCTION

In this section we turn our attention to construction andmaintenance of well connected P2P topologies. Follow-ing the spirit of the previous sections, as well as the workof [16], [2], [17], [19], we translate good connectivityto good expansion, conductance, and separation of §�¬from 1. We further translate the fact that the constructionconcerns a P2P network to the following conditions:

(i) Peers arrive and leave the network dynamically.(ii) The algorithm must be strongly or weakly decentral-

ized. By strong decentralization we mean that thereis no central server. In weak decentralization thereis a constant number of central servers. However,the computational resources of each central serverare of the same order of magnitude as those of anaverage peer. In other words, a typical peer cansimulate the behavior of a central server.

(iii) We wish to achieve low network overhead, in termsof messages, per addition or deletion. In the rest ofsection. we deal only with additions. Deletions canbe handled as in [2].

In paragraph IV-A we shall revisit the known ran-domized algorithms for constructing sparse graphs withgood expansion properties, without being concernedabout conditions (i) and (ii). We call these baselineconstructions. This is helpful because all known schemesfor constructing expander graphs under conditions (i) and(ii) explicitly simulate a baseline construction.

All known baseline algorithms construct an expanderessentially by choosing the edges incident to a vertexuniformly at random, and independently for each ver-tex. This facilitates the probabilistic arguments that aresubsequently used to establish expansion. We review thebaseline probabilistic argument in Theorem 4.1.

Thus, when constructing an expander on � vertices,a baseline construction uses ���������� random bits peredge. In a distributed setting, when vertices arrive dy-namically and a new vertex needs to extend an edgeto an existing vertex, one may use ���������� steps ofa random walk on the existing graph to find a randomexisting vertex, thus simulating the baseline constructionwith ���������� message overhead on the network. Thisis the approach of [2]. ([16] use the fact that deletionshappen in a random way, and they use the “randomness”

Page 10: Random Walks in Peer-to-Peer Networks

10

of deletions to connect new vertices).In paragraph IV-B, Theorem 4.2, we show, analyti-

cally, that a certain baseline construction achieves non-trivial expansion properties using constant number ofrandom bits per new edge. When translated to overheadin network resources, this gives a heuristic to constructan expander with constant message overhead per newvertex. In particular, instead of taking ���������� steps ofa random walk per newly arriving vertex, we keep a con-stant number of daemons which continuously simulate arandom walk on the existing network and use the verticesvisited by the daemons every ð steps as sample points,where ð is a constant, and without waiting ���������� steps until the daemon randomizes. In paragraph IV-C we report that, in experiment, the method achievesconstant separation of §µ¬ from 1 for ð as small as 1(sampling consecutive vertices visited by the daemons).The eigenvalue gap depends on ð and on the averagedegree of the constructed graph, but does not depend onthe size of the constructed graph, for � as large as 5Mvertices. Note that the current size of Kazaa is thoughtto be 2M to 4M. (Performing experiments for more than5M vertices was stressing the memory limitations of ourmachines.)

A. Baseline Construction of Expander Graphs�òñoóGôeõ is the following construction: On input � thenumber of vertices, and | the number of edges chosen byevery vertex, each vertex, independently, picks | verticesindependently and uniformly at random among the setof all vertices, and connects with an (undirected) edgeto each one of these vertices. Thus the total number ofedges is �F| and the expected degree of a vertex is ��| . (Itcan be easily seen, using (4), that all vertices will havedegree at most ���������� , almost surely.)

When we insist on all vertices having the samedegree, then we simulate �Yñoó�ôeõ by picking randomperfect matchings, or random Hamilton cycles. In par-ticular, �Õö is the following construction: Pick | perfectmatchings on � vertices independently and uniformly atrandom (assume w.l.o.g. that � is even), and considerthe union of these perfect matchings. Finally, �Y÷ isthe following construction: Pick | Hamilton cycles on� vertices independently and uniformly at random, andconsider the union of these Hamilton cycles.

Theorem 4.1 (Folklore.): Let x��+y Qz- be the graphconstructed by � ñoóGôøõ . xb�+y Qz- is an expander withhigh probability. In particular, there is a positive constant@7B[� such that

T>U . �����¹gº  0ùqú ¹ ú � ú  ïú û ¬ s{¿�� ½ :ss ½ s u*@ / u<�M3¶üg�'�� ý8 (12)

Proof: For a positive constant @ , we say that a setof vertices ½ with s ½ s¨Z�s{yPs{��� is Bad if and only ifs{¿�� ½ :s�Z�@Cs ½ s . We will show that there exists a positiveconstant @ such thatT>U .�þKÿ���� ½ / Z üg�'�� =8 (13)

The left hand side of (13) is� û ¬�� a�O T>U.�þ7ÿ���� ½ Q s ½ s�� � / 8 (14)

Let us fix � in the above range. There are at most � � � �sets of vertices of cardinality � . We hence need to bound� û ¬�� a�O

� � ��� T>U . a fixed set ½ Q s ½ s��<� , is Bad / 8 (15)

We may now assume that the set ½ is fixed. Let Ò�½ bethe set of vertices in ½ that chose to connect to verticesin

Ͻ . In order for ½ to be Bad, the cardinality s )s mustbe at most @ï� . For each cardinality in the range 0 to @�� ,there are at most � �D � �

possibilities for s ½ s , and at most@ï� possibilities for the cardinality of . We may nowassume that the set is also fixed. Finally, for fixed ½and , the probability that an edge picked by a vertexin ½� connects to a vertex in ½ is at most �%�2� , andthere are |µ�$�t3h@��% such edges. We may now write

T>U . fixed ½ Q s ½ s$�)� , is Bad/ Z^@�� � �@ï���� ������ � ¢ O j D £ 8(16)

Combining ( 14), (15) and (16), (13) gives:T>U .�þKÿ���� ½ /Z _ � û ¬� a�O T>U .�þ7ÿ���� ½ Q s ½ s� � /Z _ � û ¬� a�O � � � � @�� � �D � � À �� Á � � ¢ O j D £��� ���g � � � � � Z � � � � Z ��� �� � �Z _ � û ¬� a�O � � �� � � @���À � �D � Á D � À �� Á � � ¢ O j D £Z _ � û ¬� a�O i�� � À �� Á � ¢ � ¢ O j D £ j O £(17)

where i��ø�)i����$@t is a constant. Now the last line of (17)is bounded by üg�'�� if every term is bounded by üo�#� j O ,which is true for any |�u��g�'�g3\@t j O , since @\B?� .B. Baseline Construction of Expanders with ConstantOverhead in Random Bits

Procedure �;ñoó�ôeõ assumes that when a vertex chooses| other vertices to connect, these choices are independentand uniformly distributed in the integers 1 to � . Nowconsider the following pseudorandom number genera-tor: A constant degree expander graph

", with second

eigenvalue �E¬ , is imposed over a set of � points labeled

Page 11: Random Walks in Peer-to-Peer Networks

11

with the numbers 1 to � . We start a random walk on"

from a point chosen uniformly at random, and, wheneverthe algorithm �¼ñoóGôeõ needs a random number, we feed itwith the current point of the random walk on

". We call

this algorithm ��� ñoó�ôeõ . Realize that the random choicesof ��� ñoó�ôeõ are highly correlated, since they resultedfrom consecutive vertices visited by the random walkon

". Nevertheless, we are able to establish a non-

trivial expansion property for � ñoó�ôeõ , which essentiallydescribes constant expansion of relatively large subsetsof vertices. Our proof follows by the same probabilisticargument as Theorem 4.1, except we use the bound (9)to bound the probability of correlated bad events:

Theorem 4.2: Let xb�+y Qz- be a graph constructed by� � ñoóGôeõ . There are positive constants @ and � , A¼B��PB8 � , such that any subset ½ of at least �¨s{yIs and at mosts{y�s{��� vertices has cutset expansion @ , almost surely. Inparticular,

T>U . �����¹gº  0ù !�ú  tú "Eú ¹ ú � � û ¬ s{¿�� ½ :ss ½ s u*@ / u<�²3&üg�'�� `8 (18)

Proof: The reasoning is identical to the proof ofTheorem 4.1, up to (16). At this point we use (9): Whatis the probability that |0�G�'� 3¨@t subsequent points of therandom walk on

"corresponded to ending up inside½ , while the probability of falling outside ½ is at least�#�Õ3h�% S�2�\u4����� ? We apply (9) with do�&� and W)�X�����

and get

T>U . fixed ½ Q s ½ s$�)� , is Bad/ Z5@ï� � �@��#� Çi j�$ l ÈqÉ$Ë&% Í'#w ¢ O j)( n £ 8(19)

Now the final calculations becomeT�U .�þKÿ*�&� ½ /Z _ � û ¬� a ! � � � � � @ï�+� �D � � Çi j�$ l ÈÊÉ�Ë&% Í'#w ¢ O j)( n £Z _ � û ¬� a ! � �,� �� � � @ï� À-� �D � Á D � Çi j $ l ÈÊÉ�Ë&% Í'#w ¢ O j)( n £Z _ � û ¬� a ! � i j � � i $ l ÈÊÉ�Ë&% Í'#w ¢ O j)( n £(20)

where i��ø�)i��$�$@ Q �t is a constant, and the last line of (20)is bounded by üg�'�� for some constant |ï�P|¯�$@ Q �ï . Notethat �Pu.�F� is crucial to bound �#�����% � by i�� .

C. Distributed Construction of Expanders with ConstantOverhead on Network Resources

In this paragraph, we study how the concept of Para-graph IV-B can be used to speed up the approach of [2].We examine two algorithms.

1) ���÷ : This is an extension of the scheme proposedin [2]. The authors propose a scheme to implement the� ÷ construction in a distributed, decentralized environ-ment. To ensure random placement of each arriving node

TABLE V/10OF 2435 AS A FUNCTION OF SIZE AND NUMBER OF RANDOM

WALK STEPS.

Size Random 6 798 n�: 5 3 1 0(K)10 0.7455 0.7445 0.7452 0.7506 0.7906 0.999950 0.7453 0.7457 0.7459 0.7508 0.7924 0.9999100 0.7452 0.7453 0.7460 0.7504 0.7938 0.9999500 0.7453 0.7453 0.7463 0.7504 0.7944 0.9999

1000 0.7453 0.7452 0.7462 0.7503 0.7956 0.99995000 0.7453 0.7454 0.7462 0.7504 0.8023 0.9999

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nodes

node

s

Fig. 4. The connectivity matrix of a topology constructed using 2 35for ; Ú�< . The strong dependencies are reflected in the concentrationalong the diagonal. However, there are many points away from thediagonal and the picture appears random.

in each of the | cycles, they propose techniques toestimate the size of the network � and then perform |random walks of length �*�����>�� , thus their overhead is���������� .

Instead, we keep | daemons, one for each Hamiltoncycle. These daemons move freely in the topology. Whena new node arrives, it contacts the daemon associatedwith the ] -th Hamilton cycle, for �=Z ]CZ*| , and insertsitself between the peer that currently hosts daemon ] andone of its two neighbors in cycle ] . We require that thedaemons perform ð number of steps before allowing anew peer to join the topology. Observe that in [2] ð is���������� . In fact, when ð is ���������� , the daemons areallowed to start from any node of the topology and thismakes the algorithm of [2] fully decentralized.

We measure the goodness of the constructed topologyby the second eigenvalue of the corresponding transitionmatrix. See Table V and Figure 4. It is obvious that §¯¬remains constant as the topology scaled from 10K to 5Mnodes. On the other hand, §µ¬ depended on ð . However,notice that for ðÕ�1� , §µ¬ is larger only by Ao8�A=� comparedto ðÕ� ���>� . (As a sanity test, when ð is A , that is withoutrandomization, there is no expansion.) § ¬ also dependson | . In Table V we give the case of |��?> , that is degree@. The trends are identical for larger values of | .

2) ���ö : The existence of Hamilton cycles in � ÷ and� �÷ is a good property since it guarantees connectivity.

Page 12: Random Walks in Peer-to-Peer Networks

12

TABLE VI/10OF 243A AS A FUNCTION OF SIZE, DEGREE B AND NUMBER OF

RANDOM WALK STEPS ; .

Size d=4 d=4 d=4 d=4 d=6 d=6 d=6(K) c=1 c=2 c=5 c=10 c=1 c=5 c=101 0.9754 0.8982 0.8711 0.8600 0.7782 0.7385 0.739210 0.9893 0.9131 0.8732 0.8654 0.7854 0.7468 0.744350 0.9939 0.9144 0.8777 0.8670 0.8015 0.7471 0.7450100 0.9929 0.9312 0.8925 0.8673 0.8273 0.7470 0.7456500 0.9969 0.9482 0.8833 0.8679 0.8332 0.7472 0.7454

1000 0.9995 0.9421 0.8861 0.8679 0.8287 0.7476 0.74555000 0.9996 0.9504 0.8846 0.8677 0.8348 0.7473 0.7454

Maintaining Hamilton cycles however is difficult. In thisparagraph, we consider a distributed implementation ofthe �Õö algorithm which does not require the existenceof a special structure in the topology and thus is easierto implement. On the other hand, the price paid is anincreased second eigenvalue. Also, the second eigenvaluedoes not remain relatively constant with the size of thenetwork, but it increases slightly as we increase thenumber of peers.

Our algorithm works as follows. Again, we maintain| daemons. We model the arrival of a new node, as thearrival of two nodes

Nand Ä , each with degree | . Upon

the arrival,N

and Ä contact the central server to discoverthe location of the | daemons. Assume that daemon ] islocated at node � , and that the ] -th neighbor of � is .The connection between � and is teared down and thenew ] -th neighbor of � is

Nand the new ] -th neighbor

of is Ä . Between each arrival, the daemons move ðsteps.

The performance of the topology constructed by ouralgorithm, measured in terms of §¯¬ , for various sizes ofthe topology, values of | , and ð are given in Table VI.Observe that the second eigenvalue in this constructionis a function of the degree, the number of steps andthe size of the topology. However, for degree

@and forðI��� , the second eigenvalue for �1C nodes is 0.8348

which is comparable to the corresponding entrance valuein Table V, which is 0.8023. The interpretation is thatfor current sizes of the network and for degree at least6, both methods achieve equally good results.

V. SUMMARY

In this paper, we focus on the power of sampling usingrandom walks in peer-to-peer networks. We providetheoretical justification in support of random walks asa primitive operation for P2P networks.

We argue that, in the context of searching, randomwalks are superior to flooding in two cases of practicalinterest. Related work was previously done in [1].

For construction of dynamic P2P topologies, we userandom walks to add new peers with constant overhead.

This improves the algorithm of [2].Open problems: 1) The construction algorithm is

weakly decentralized; we wish to make it strongly decen-tralized. We have reduced deletions to [2]; can we handlethem more effectively. Theorem 4.2 shows expansionof large sets; can we expand it to small sets? 2) [29]gives a model for P2P networks with several parameters(capacities, etc.). How do these parameters affect theperformance of random walks? 3) We have suppressedmany implementational details. In practice, we expectadaptations of the random walk methods and in generala hybrid between random walks and other methods.

REFERENCES[1] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and replication in unstructured peer-to-

peer networks,” in International Conference on Supercomputing (ICS’02). ACM, 2002.[2] Ching Law and Kai-Yeung Siu, “Distributed construction of random expander networks,” in

Proc. Infocom. IEEE, 2003.[3] M. Ajtai, J. Komlos, and E. Szemeredi, “Deterministic simulation in logspace,” Proc. 19th ACM

Symp. on Theory of Computing (STOC), 1987.[4] D. Aldous, “On the markov chain simulation method for uniform combinatorial distributions and

simulated annealing,” Probab. Engrg. Inform. Sci., 1, pp 33-46, 1987.[5] R. Impagliazzo and D. Zuckerman, “How to recycle random bits,” Proc. 30th IEEE Symp. on

Foundations of Computer Science (FOCS), 1989.[6] D. Gillman, “A chernoff bound for random walks on expander graphs,” Proc. 34th IEEE Symp.

on Foundations of Computer Science (FOCS93), vol. SIAM J. Comp., Vol 27, No 4, 1998.[7] William Feller, An Introduction to Probability Theory and Its Applications, vol. I, John Willey

& Sons, 3rd edition, 1968.[8] N. Alon and J. Spencer, The Probabilistic Method, John Wiley & Sons, 2000.[9] J.H. Wilkinson, The Algebraic Eigenvalue Problem, Oxford Science Publications, 1965.

[10] G.H. Golub and C. F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1989.[11] A. Sinclair, Algorithms for Random Generation and Counting: A Markov Chain Approach,

Springer-Verlag, 1997.[12] A. Broder and A. Karlin, “Bounds on the cover time,” J. Theoretical Probab., 2:101-120, 1989.[13] D. Aldous and J. Fill, “Reversible markov chains and random walks on graphs,” Monograph,

http://stat-www.berkeley.edu/users/aldous/book.html, 2002.[14] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomikns, and

J. Wiener, “Graph structure in the web,” 9th International World Wide Web Conference(WWW9)/Computer Networks, vol. 33, no. 1-6, pp. 309–320, 2000.

[15] S. Dill, R. Kumar, K. McCurley, S. Rajagopolan, D. Sivakumar, and A. Tomkins, “Self-similarityin the web,” in International Conference on Very Large Data Bases, Rome, 2001, pp. 69–78.

[16] Gopal Pandurangan, Prabhakar Raghavan, and Eli Upfal, “Building low-diameter p2p networks,”in IEEE Symposium on Foundations of Computer Science, 2001, pp. 492–499.

[17] C. Gkantsidis, M. Mihail, and A. Saberi, “Conductance and congestion in power law graphs,”in Proc. SigMetrics. ACM, 2003.

[18] Fan Chung, Lincoln Lu, and Van Vu, “The spectra of random graphs with given expecteddegrees,” Proceedings of National Academy of Sciences, vol. 100, no. 11, pp. 6313–6318, 2003.

[19] M. Mihail, C.H. Papadimitriou, and A. Saberi, “On certain connectivity properties of internetgraphs,” FOCS, 2003, to appear.

[20] Andy Oram, Ed., Peer-to-Peer: Harnessing the Power of Disruptive Technologies, O’Reilly, firstedition, 2001.

[21] David Gibson, Jon M. Kleinberg, and Prabhakar Raghavan, “Inferring web communities fromlink topology,” UK Conference on Hypertext, 1998.

[22] C. Gkantsidis, M. Mihail, and E. Zegura, “Spectral analysis of internet topologies,” in Proc.Infocom. IEEE, 2003.

[23] R. Kumar, S. Rajagopalan, D. Sivakumar, and A. Tomkins, “Trawling the web for emergingcyber-communities,” in WWW8/Computer Networks, 1999, vol. 31, pp. 1481–1493.

[24] Tian Bu and Don Towsley, “On distinguishing between internet power law topology generators,”in Proc. Infocom. IEEE, 2002.

[25] N. Leibowitz, M. Ripeanu, and A. Wierzbicki, “Deconstructing the kazaa network,” in Proc.WIAPP. IEEE, June 2003, pp. 112–120.

[26] Limewire.org, “Snapshots of the gnutella network,” http://crawler.limewire.org/data.html, 2002.[27] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relationships of the internet

topology,” in Proc. SigComm. ACM, 1999.[28] M. Jovanovic, F. Annexstein, and K. Berman, “Modeling peer-to-peer network topologies

through small-world models and power laws,” in IX Telecommunications Forum, 2001,http://www.ececs.uc.edu/˜mjovanov/Research/telfor.pdf.

[29] Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker, “Making gnutella-like p2psystems scalable,” in Proc. SIGCOMM. ACM, 2003, to appear.

[30] Albert-Laszlo Barabasi and Reka Albert, “Emergence of scaling in random networks,” Science,vol. 286, pp. 509–512, 1999.