topology formation and replication strategies for gnutella ...pabitra/facad/06cs6019t.pdfa brief...
TRANSCRIPT
Topology Formation andReplication Strategies for
Gnutella Network
Santosh Kumar Shaw
Topology Formation andReplication Strategies for
Gnutella Network
Thesis submitted in partial fulfillment of the requirementsfor the degree of
MASTER OF TECHNOLOGY
by
Santosh Kumar Shaw
Under the supervision of
Prof. Niloy Ganguly
Department of Computer Science and EngineeringIndian Institute of Technology, Kharagpur
West Bengal, India 721302May 2008
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur
Kharagpur, India 721302.
Certificate
This is to certify that the thesis entitled Topology Formation and Replication
Strategies for Gnutella Network, submitted by Santosh Kumar Shaw, to
the Department of Computer Science and Engineering, Indian Institute of Tech-
nology, Kharagpur, India, in partial fulfillment for the award of the Master of
Technology, is a record of an original research work carried out by his under
my supervision and guidance. The thesis fulfills all requirements as per the reg-
ulations of this Institute and in our opinion has reached the standard needed for
submission. Neither this thesis nor any part of it has been submitted for any
degree or academic award elsewhere.
Prof. Niloy Ganguly
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur
India 721302.I.I.T. Kharagpur
5th May, 2008
Acknowledgment
I would like to express my deep gratitude to Prof. Niloy Ganguly for his invaluable
guidance, help and encouragement throughout my work. I would like to take this
opportunity to my sincere and profound gratitute to him for the pains he has
taken for the project. I have learned many many things in life from him.
I am grateful to all the faculty members of the CSE Department for their teaching,
support and encouragement. I am particularly indebted to Prof. Arijit Bishnu,
Prof. Indranil Sengupta and Prof. Arobinda Gupta for their guidance and moti-
vation. I wish to thank the Software Laboratory staff and all the secretarial staff
of the CSE Department for their sympathetic co-operation. I am grateful to all
my friends in the Department of CSE for their help and encouragement.
I give my heartfelt thanks to my parents and my in-laws for their constant support.
Without their help and blessings this venture would never have been possible.
Finally, I thank my wife, Pallavi, whose unending help, encouragement, inspiration
and love made me pursue my dreams. Without her, this would simply not have
happened. I dedicate this thesis to her.
Santosh Kumar Shaw
Computer Science and Engineering
Indian Institute of Technology –Kharagpur INDIA
Abstract
Gnutella-like networks generate enormous amount of redundant messages in flood-
based search due to their topology structure and therefore lose scalability.
In this thesis, we propose a completely distributed handshake protocol (named
HPC5) to form an efficient topology for Gnutella-like networks. The protocol
directs each peer to select neighbors in such a way that any cyclic path present in
the overlay network will have a minimum length of 5.
We show that our approach can be deployed into the existing Gnutella network
without disturbing any of its parameters. Simulation results signify that HPC5 is
very effective for Gnutella’s dynamic query search over limited flooding. Structural
analysis indicates that the proposed network is as robust as existing Gnutella
network.
Limited flood-based search with lower TTL decreases the chance of getting
proper results, consequently decreases the number of results. To increase the num-
ber of results (query hits) we evaluate some of the index table replica placement
strategies and compared our replica placement technique against two existing ap-
proaches. Simulation shows that our placement technique is very useful in terms
of query hits, latency and search cost in our topological structure.
Hit ratio is one of the main obstacle in implementing our approach. Thus a
hit ratio based design approach is given to satisfy some network’s requirements.
Contents
List of Figures v
List of Tables vi
1 Introduction 1
2 Related Works 5
3 System Model 7
3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Gnutella 0.6 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Simulated Gnutella Network . . . . . . . . . . . . . . . . . . . . . . 11
4 Handshake Protocol for Cycle-5 Networks 12
4.1 Handshake Protocol: HPC5 . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Hurdles in Implementing the Scheme . . . . . . . . . . . . . . . . . 14
4.2.1 Compatibility with the Current Population of Gnutella . . . 15
4.2.2 Hit Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Consistency Problem . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Comparison with Zhu’s work . . . . . . . . . . . . . . . . . . . . . . 28
CONTENTS iv
5 Index Table Replication 30
5.1 Replication in Gnutella Network . . . . . . . . . . . . . . . . . . . . 31
5.2 Replica Placement Techniques . . . . . . . . . . . . . . . . . . . . . 31
5.3 Search Technique (Flooding) in Cycle-5 Networks with Replication . 32
5.4 Replica Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5.1 RPT Selection: . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5.2 Number of Replica Selection . . . . . . . . . . . . . . . . . . 37
6 Design Issues 38
7 Conclusion and Future work 41
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
List of Figures
1.1 Effect of topology structure and replication on limited flood based
search. The number inside the circle represents the TTL value
required to reach that node from start node S in flooding. . . . . . 3
4.1 Selection of neighbor by an ultrapeer-1 after making peer-2 as a
neighbor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 A part of an Ultra-peer layer, where a node represents all nodes
that are present in that level. Like, Q represents all 1st neighbors
of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Theoretical and empirical hit ratio of a peer against number of peers
in the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 A part of cycle-5 network, representing parallel update inconsistency 20
4.5 Comparison on Network Coverage and Message Complexity . . . . . 25
4.6 Distribution of Higher order Clustering Coefficient . . . . . . . . . . 28
4.7 Comparison of Robustness between cycle-3 and cycle-5 Networks . . 29
5.1 Query Hit probability . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1 Effects of rx in the network . . . . . . . . . . . . . . . . . . . . . . 40
List of Tables
3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Properties of Gnutella and Simulated Gnutella Network . . . . . . 11
4.1 Simulation Results for Search Performance . . . . . . . . . . . . . . 26
C H A P T E R 1
Introduction
A peer-to-peer (P2P) system is an overlay network that relies on the computing
resources available from all the participants in the network instead of concentrat-
ing on a small number of centralized servers. These networks are useful for many
purposes like file-sharing, streaming media, internet telephony, distributed com-
putation etc. Depending upon the topology formation, P2P networks are broadly
classified as unstructured and structured. An unstructured P2P network is formed
when the overlay links are established arbitrarily. Structured P2P networks ensure
that any peer can efficiently route a search to some peer that has the desired files,
even if the file is extremely rare. In these networks, the placement of resources
are very controlled and defined, which arise the necessity of a structured pattern
in overlay links.
Decentralized (fully distributed control), unstructured P2P networks (Gnutella,
FastTrack etc) are the most popular file-sharing overlay networks. The absence of
a structure and central control makes such systems much more robust and highly
self-healing compared to structured systems [9, 15]. But the main problem of these
2
kinds of networks is scalability due to generation of large number of redundant
messages during query search. Consequently as these networks are becoming more
popular the quality of service is degrading rapidly [6, 10].
To make the network scalable, Gnutella [1, 2, 3] is continuously upgrading it’s
features and introducing new concepts. All these improvements can be catego-
rized into two broad areas: improvements of search techniques and modification
of the topological structure of the overlay network to enhance search efficiency.
In enhanced search techniques, several improvements like Time-To-Live (TTL),
Dynamic query, Query-caching and Query Routing Protocol (QRP) have been in-
troduced. One of the most significant topological modification in unstructured
network was done by inducing the concept of super-peer (ultra peer) with a two-
tier network topology.
The basic search mechanism adhered by Gnutella is limited flooding [1, 2, 3].
In flooding, a peer, which searches for a file, issues a query and sends it to all of its
neighbor peers. The peer which receives the query forwards it to all its neighbors
except the neighbor from which it is received. By this way, a query is propagated
up to a predefined number of hops (TTL) from the source peer. The TTL followed
by Gnutella is generally 1 or 2 for popular searches. However, TTL(3) (numeric
value inside parenthesis represents the number of hops to search with. Henceforth,
we follow this notation) is initiated for rare searches.
Objective and Contributions
The main goal of this dissertation is to improve the scalability of the Gnutella
network by reducing redundant messages and increase the number of query hits
(results). One of the ways to achieve this is to modify the overlay network, so that
small size loops get eliminated from the overlay topology. The rationale behind
the proposition is explained through Fig. 1.1. In this figure, all three networks
have the same number of connections. With a TTL(2) flooding, the network in
3
Fig. 1.1(a) discovers 4 peers at the expense of 7 messages, whereas the network in
Fig. 1.1(b) discovers 6 peers without any redundant messages. This happens due
to the absence of any 3-length cycle in the network of Fig. 1.1(b). On generalizing,
we can say that for a TTL(r) flooding, networks devoid of cycles of length less than
(2r + 1) do not generate any redundant messages. In Fig. 1.1(c), the placement
of replica of index table (hash table of all the files the peer is sharing) is shown
using the dotted arrow lines. Like all node 2’s replicas are stored in node S. So
node S can access all node 2 using replicas in TTL(0) and S can access all other
nodes either directly or using replicas in TTL(1). As a result, whole network is
explored with TTL(1) flooding which reduces query latency.
S
1 1
22
3 3
4 4(a)
S
1 1
2 2
2 2
3
(b)
3
S
2 2
1 1
2 2
3 3
(c)
Figure 1.1: Effect of topology structure and replication on limited flood basedsearch. The number inside the circle represents the TTL value required to reachthat node from start node S in flooding.
We make five contributions in this thesis: (1) We propose a completely distrib-
uted handshake protocols (HPC5) which generates a cycle-5 network (a network
which does not have any cycles up to length (r − 1) is referred as cycle-r net-
work) and extended the algorithm for cycle-r (r > 3) networks. (2) We show
that our approach can be deployed into the existing Gnutella network without
disturbing any of its parameters. (3) Through simulation results, we show that
4
cycle-r networks are very effective for Gnutella’s dynamic query search over lim-
ited flooding. Structural analysis indicates that the cycle-5 network is as robust as
existing Gnutella network. (4) We propose and evaluate a new index table replica
placement technique for cycle-5 networks to improve query hits and reduce both
latency and search cost. (5) We analyze some design issues related to hit ratio of
the network.
Organization of the Thesis:
The rest of the thesis is organized as follows.
A brief overview of previous works done on topology formation and replication
strategies of unstructured P2P networks is presented in Chapter 2.
A model of our network environment is presented in Chapter 3. Mainly topol-
ogy, searching technique, replication technique and basic handshake protocol of
Gnutella network are described in this chapter.
Chapter 4 propose a completely distributed handshake protocol (HPC5) for
cycle-5 networks. We also extend the protocol for cycle-r networks. The problems
and compatibility of implementing our protocol in Gnutella network are described
in this chapter. Through simulations we evaluate the search performance and
structural property of the evolved networks to validate our approach.
Chapter 5 discuss some of the replication technique and propose a new repli-
cation techniques for cycle-5 networks.
In Chapter 6 we discuss hit ratio based design of cycle-5 networks and its
effects on performance.
Finally, Chapter 7 concludes the thesis and proposes some directions for future
work.
C H A P T E R 2
Related Works
Many algorithms exist in the literature which modify the topology in unstructured
P2P networks to solve the excessive traffic problem and improve query hits by
using replication techniques.
The structural mismatching between the overlay and underlying network topol-
ogy is alleviated by using location aware topology matching algorithms [7, 8].
A class of overlay topology based on distance between a node and its neighbors
in the physical network structure is presented in [11].
Papadakis et al. presented an algorithm to monitor the ratio of duplicated
message through each network connection and the node does not forward any
query through that connection whose ratio exceeds certain threshold [14].
Zhu et al. very recently presented a distributed algorithm in [18] to improve
the scalability of Gnutella like networks by reducing redundant messages. They
have pointed the same concept of elimination of 3 and 4-length cycles. However
this is demand driven and involves a lot of control overhead. Also it is not clear
how the algorithm will perform in the face of heavy traffic. The algorithm also
6
does not take care in preserving the Gnutella parameters (like degree distribution,
average peer distance, diameter, etc), hence robustness of the evolved network is
not maintained. In our work we take into considerations all the above aspects
and propose a holistic approach to topology formation. The algorithm initiates as
soon as a peer enters in the network rather than having it demand driven.
Qin et al. discussed on various alternatives of data replication, search tech-
niques and network topology for Gnutella network in [10]. The authors described
some replication techniques (uniform, proportional and square-root) depending
upon the query distribution. They also shown that random replication is better
than owner or path replication in Gnutella-like networks.
C H A P T E R 3
System Model and Notations
In this chapter we discuss about the basic Gnutella protocols and our simulated
Gnutella network. We give a list of notations in Section 3.1 for ready reference.
In Section 3.2 we describe the basic Gnutella 0.6 protocol which we have used as
our system model. Then we describe about our simulated Gnutella network in
Section 3.3.
3.1 Notations
Before describing the system model, we first list the main notations in table 3.1
that are going to be used in this thesis report for additional clarity and to avoid
any kind of confusion.
3.1 Notations 8
Table 3.1: Notations
TTL(r) Query search with TTL = rcycle-r A cycle of minimum length r
cycle-r network A network which does not have any cycle up to length (r − 1)cycle-3 network Gnutella network
N Total number of peers in the networkU Total number of ultra-peers in the networkL Total number of leaf-peers in the network
duu Avg. no. of ultra-neighbors of an ultra-peerdul Avg. no. of leaf-neighbors of an ultra-peerdlu Avg. no. of ultra-neighbors of a leaf-peerHk Hit ratio to select kth ultra-neighbor〈H〉 Average hit ratio of a peer〈Hev〉 Average evolved hit ratio of a peer
rth neighbor A peer at a distance of r hops. All immediate neighbors are1st neighbors, all neighbors of 1st neighbors are 2nd neighborsand so on.
3.2 Gnutella 0.6 Protocol 9
3.2 Gnutella 0.6 Protocol
The basic Gnutella [1, 2, 3] consists of a large collection of nodes that are assigned
unique identifiers and which communicate through message exchanges.
Topology: Gnutella 0.6 is a two-tier overlay network, consisting of two types
of nodes : ultra-peer and leaf-peer. An ultra-peer is connected with a limited
number of other ultra-peers and leaf-peers. A leaf-peer is also connected with
some ultra-peers. However, there is no direct connection between any two leaf-
peers in the overlay network. Yet another type of peer is called legacy-peers, which
are present in ultra-peer level and do not accept any leaves. In our model we are
not considering legacy-peers.
Search Technique: The network follows limited flood based query search. A
query of an ultra-peer is forwarded to its leaf-peers with TTL(0) and to all its
ultra-neighbors with one less TTL only when (TTL > 0). A leaf-peer does not
forward any received query. On the other hand ultra-peers perform query searching
on behalf of their leaf peers. The query of a leaf-peer is initially sent to its
connected ultra-peers. All the connected ultra-peers simultaneously forward the
query to their neighbor ultra-peers up to a limited number of hops. Since multiple
ultra-peers are initiating flooding, a leaf-peer’s query will produce more redundant
messages if the distance between any two ultra-neighbors are not enough. Gnutella
0.6 incorporates dynamic querying over limited flooding as query search technique.
In dynamic querying, an ultra-peer incrementally forwards a query in 3 steps
(TTL(1), TTL(2), TTL(3) respectively) through each connection while measuring
the responsiveness to that query. The ultra-peer can stop forwarding query at any
step if it gets sufficient number of query hits. Consequently dynamic querying
uses TTL(3) only for rare searches.
Query Routing Protocol (QRP): An ultra-peer forwards query only to it’s
3.2 Gnutella 0.6 Protocol 10
leaf neighbors which likely to have a result, the technique is called QRP. A leaf-
peer creates a hash table of all the files it is sharing and sends to all the connected
ultra-peer/s. As a result, when a query reaches to an ultra-peer it is forwarded
to only those connected leaf-peers which would have query hits. So searching is
performed only at the ultra-peer layer, since ultra-peers contain the indices of their
children.
Basic Handshake Protocol: Many softwares (clients) are used to access the
Gnutella network (like Limewire, Bearshare, Gtk-gnutella). The most popular
client software, Limewire’s handshake protocol is used in our simulation as a base
handshake protocol. Through handshaking, a peer establishes connection with
any other ultra-peer. To start handshake protocol a peer first collects the address
of an online ultra-peer from a pool of online ultra-peers. A peer can collect the
list of online peers from hardcoded address/es and/or from GwebCache systems
[4] and/or through pong-caching and/or from its own hard-disk which has obtain
list of online ultra-peers in the previous run [6]. The handshake protocol is used
to make new connections. A handshake consists of 3 groups of headers [1, 2]. The
steps of handshaking is elaborated next:
1. The program (peer) that initiates the connection sends the first group of
headers, which tells the remote program about its features and the status to
imply the type of neighbor (leaf or ultra) it wants to be.
2. The program that receives the connection responds with a second group
of headers which essentially conveys the message whether it agrees to the
initiator’s proposal or not.
3. Finally, the initiator sends a third group of header to confirm and establish
the connection.
3.3 Simulated Gnutella Network 11
This protocol is modified in this thesis to overcome the problem of message over-
head.
3.3 Simulated Gnutella Network
To generate existing Gnutella network, we have simulated a strip down version of
Gnutella 0.6 protocols which follows Limewire [1]. Our implementation is in C-
language and in Linux platform. When a new node wants to join in the network,
a separate thread is allotted for bootstrapping and neighbors selection process for
that peer. We have followed the basic handshake protocol described in chapter 3
for bootstrapping process. The similar way we have generated cycle-5 networks
by following HPC5 protocol. Our simulated Gnutella network exhibits all fea-
tures (like degree distribution, diameter, average path length between two peers,
proportion of ultra-peers, clustering coefficient, etc.) that are related to Gnutella
network which are obtained from the snapshots collected by crawlers [3, 16, 17].
The properties of the Gnutella and simulated Gnutella networks are given in the
table 3.2 [1, 17].
Table 3.2: Properties of Gnutella and Simulated Gnutella Network
Property Gnutella Simulated Gnutella
No. of Peers 2000k 100kUltra-peer ratio 15-16% of total peers 15% of total peersAvg. Diam. of ultra-layer 6-7 4-5Maximum connections ultra-ultra 32 ultra-ultra 32
ultra-leaf 30 ultra-leaf 30leaf-ultra 3 leaf-ultra 3(Applicable to Limewire)
Average Connections duu: 25-26 duu: 22-23dul: 20-22 dul: 17-18dlu: 3-4 dlu: 3
Clustering Coeff. 0.02 0.08-0.10
C H A P T E R 4
Handshake Protocol for Cycle-5
Networks (HPC5)
In this chapter, we first discuss our handshake protocol for cycle-5 networks in
Section 4.1. In Section 4.2 we discuss about the hurdles to implement the pro-
tocol (HPC5) in the real-life networks (Gnutella). Then we evaluate the search
performance and robustness of HPC5 protocol generated networks (cycle-5 and
cycle-6 networks) in Section 4.3. Finally in Section 4.5, we compare our approach
with the DCMP protocol [18] which is very similar to our work.
4.1 Handshake Protocol: HPC5
We modify the basic handshake protocol to generate cycle-5 networks. Fig. 4.1
illustrates the proposed HPC5 graphically. In Fig. 4.1, peer-1 requests other
online ultra-peers to be its neighbor, given that, peer-2 is already a neighbor of
peer-1. In Fig. 4.1(a) and 4.1(b), the possibility of the formation of triangle and
4.1 Handshake Protocol: HPC5 13
quadrilateral arises if a 1st or 2nd neighbor of peer-2 is selected. However, this
possibility is discarded in Fig. 4.1(c) and a cycle of length 5 is formed.
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(c)(a) (b)
Figure 4.1: Selection of neighbor by an ultrapeer-1 after making peer-2 as a neigh-bor.
Each peer maintains a list of its 1st and 2nd neighbors, which contains only
ultra-peers (because a peer only sends request to an ultra-peer to make neigh-
bor). The 2nd ultra-neighbors of a leaf-peer represents the collection of 1st ultra-
neighbors of the connected ultra-peers. To keep updated knowledge, each ultra-
peer exchanges its list of 1st neighbors periodically with its neighbor ultra-peers
and sends the list of 1st neighbors to its leaf-peers. To do this with minimal
overhead, piggyback technique can be used in which an ultra-peer can append
its neighbor list to the messages passing through it. The three steps of modified
handshake protocol (HPC5) is described below.
1. The initiator peer first sends a request to a remote ultra-peer which is not
in its 1st or 2nd neighbor set. The request header contains the type of the
initiator peer. The presence of remote peer in 2nd neighbor set implies the
possibility of 3-length cycle. In Fig. 4.1, peer-1 cannot send request to peer
2 or 3, on the other hand peer 3 & 4 are eligible remote ultra-peers.
2. The recipient replies back with its list of 1st neighbors and the neighbor-hood
acceptance/rejection message. If the remote peer discards the connection in
this step, the initiator closes the connection and keeps the record of neighbors
of the remote peer for future handshaking process. On acceptance of the
invitation by the remote-peer, the initiator peer performs the following tasks.
4.2 Hurdles in Implementing the Scheme 14
3. The initiator peer checks at least one common peer between its 2nd neighbor
set (say, A) and the 1st neighbor set of the remote peer (say, B). A common
ultra-peer between sets A and B indicates the possibility of 4-length cycle.
If no common peer is present between sets A and B then the initiator sends
accept connection to remote peer.
Otherwise the initiator sends reject connection to remote peer.
HPC5 prevents the possibility of forming a cycle of length 3 or 4 and generates a
cycle-5 network.
The generalized version of the protocol HPCr (r > 3) is as follows. To generate
cycle-r networks, each ultra-peer keeps up to (r − 3)th neighbors lists in its data-
structure and exchanges up to (r − 4)th neighbors lists with its immediate ultra-
neighbors. Obviously, the bandwidth requirement will be more as r increases. For
example, to generate cycle-6 networks each ultra-peer exchanges its list of both
1st and 2nd neighbors periodically with its neighbor ultra-peers and sends the list
of both 1st and 2nd neighbors to its leaf-peers. In result, each peer contains the
information up to its 3rd neighbor ultra-peers.
4.2 Hurdles in Implementing the Scheme
In this section we mainly discuss the hurdles of implementing cycle-5 networks.
Before embedding HPC5 in Gnutella network, several important questions have to
be taken into consideration to assess the viability of HPC5. The most important
of them are listed below.
1. Is this scheme compatible with the current populations of Gnutella network?
2. On average, how many trials are required to get an ultra-neighbor?
4.2 Hurdles in Implementing the Scheme 15
3. Is there any possibility of inconsistency and if so, how can such inconsistency
be removed?
Each of the questions are discussed one by one.
4.2.1 Compatibility with the Current Population of Gnutella
From ultra-peer point of view, the total number of ultra-leaf connections is U · dul
and from leaf-peer point of view it is L · dlu. By equating both, we get
U · dul = L · dlu
U · dul = (N − U) · dlu
N = U · (dul + dlu)
dlu
(4.1)
Fig. 4.2 represents a part of an ultra-peer layer where P has immediate neighbors
at level Q. Suppose, P is already connected with (duu − 1) number of ultra-peers
at Q level and wants to get dthuu ultra-neighbor. According to HPC5, P should not
connect to any ultra-peer from R or S level as its next neighbor. However, T can
be a neighbor of P.
P T S R Q
Figure 4.2: A part of an Ultra-peer layer, where a node represents all nodes thatare present in that level. Like, Q represents all 1st neighbors of P
Thus, we can say that if P wants to make a new ultra-neighbor then P has to
exclude at most (duu−1), (duu−1)2 and (duu−1)3 number of ultra-peers from Q, R
4.2 Hurdles in Implementing the Scheme 16
and S level respectively. So, total [(duu−1)+(duu−1)2 +(duu−1)3] ≈ d3uu number
of peers cannot be considered as next neighbor(s) of P. Therefore the number of
ultra-peers in the network needs to be at least
U ≈ d3uu (4.2)
From equations 4.1 and 4.2 we get
N ≈ (dul + dlu) · d3uu
dlu
(4.3)
Presently Gnutella network is having the population of almost 2000k of peers
at any time [1]. From equation 4.2 it can be seen that for the present values
of duu, dul & dlu (table 3.2), 120-130k peers are sufficient to implement HPC5
protocol. However, to form cycle-6 networks (HPC6) the number of peers
N ≈ (dul + dlu) · d4uu
dlu
required is more than 2000k. Hence the current population will not be able to
support any such attempts.
4.2.2 Hit Ratio
Hit ratio is defined as the inverse of the number of trials required to get a valid
ultra-peer neighbor. As our protocol puts some constraints on neighbor selection,
a contacted agreeing remote ultra-peer may not be selected as neighbor. Math-
ematically, on an average if a peer (say, P) is looking for its kth ultra-neighbor
and the mthk contacted ultra-peer satisfies the constraints and becomes kth ultra-
neighbor of P, then the hit ratio for kth neighbor will be Hk = 1mk
. We first make
a static analysis of hit ratio, then fine tune it considering that the network is
4.2 Hurdles in Implementing the Scheme 17
evolving.
At the time of kth ultra-neighbor selection in HPC5, a peer (say, P) does not
consider its 1st ((k − 1) ultra-peers) and 2nd ((k − 1)(duu − 1) ultra-peers) ultra-
neighbors as a potential neighbor and this exclusion is locally done by checking
the 1st & 2nd neighbors lists of P. The number of ultra-peers excluded (U ′) is
[(k−1)+(k−1)(duu−1)]. According to step-3 of HPC5, P cannot make neighbor
from any ultra-peer of level S (3rd ultra-neighbors of P) of Fig. 4.2 as its neighbor
which are U ′′ = [(k− 1)(duu− 1)2] in number. So, total [U ′+U ′′] number of ultra-
peers are excluded. We assume that the probability of getting any ultra-peer is
uniform. So hit ratio can be given as
Hk =U − (U ′ + U ′′)
U − U ′
Assuming U ′ ¿ U, U ′ ¿ U ′′ and U ′′ ≈ d2uu · (k − 1)
Therefore Hk becomes
Hk ≈ U − d2uu · (k − 1)
U(4.4)
The upper bound of k and consequently average ultra-degree differs in leaf-peer
and ultra-peer. To generalize further calculations, let m be the average ultra-
degree of a peer. So, average hit ratio is
〈H〉 =1
m·
m∑
k=1
Hk
=1
m·
m∑
k=1
[1− d2uu · (k − 1)
U]
= 1− d2uu · (m− 1)
2 · U (4.5)
4.2 Hurdles in Implementing the Scheme 18
The equation 4.5 shows the average hit ratio of peer joining the network when the
population of ultra-peers in the network is U . It also reflects that (1 − 〈H〉) is
inversely proportional to the number of ultra-peers (U) in the complete network.
Now as each node joins, the network grows. As a result the average hit ratio
changes with the network growth. Therefore evolved hit ratio is the average value
of all average hit ratios which are calculated at each growing stages of the network.
Let U0 and Un be the number of ultra-peers in the initial and final networks. So,
evolved hit ratio is
〈Hev〉 =1
Un − U0
·Un∑
Ui=U0
〈H〉
= 1− d2uu · (m− 1)
2 · (Un − U0)·
Un∑Ui=U0
1
Ui
(4.6)
Now,
n∑
k=1
k =
(1
1
)+
(1
2+
1
3
)+
(1
4+
1
5+
1
6+
1
7
)...
≤ 1 + 1 + 1 + ... log n terms
≤ log n
Therefore,
n∑
k=m
k ≤ log n− log m
≤ log n/m
and the equation 4.6 becomes
〈Hev〉 ≈ 1− d2uu · (m− 1)
2 · (Un − U0)· log (Un/U0) (4.7)
4.2 Hurdles in Implementing the Scheme 19
From equations 4.5 and 4.7 we get,
〈Hev〉 ≤ 〈H〉
As d and the maximum value of m are bounded, the value of 〈Hev〉 increases with
U . Again we have tested this phenomenon through our simulation and plotted the
evolved hit-ratio against the network size of 200k-1000k in Fig 4.3 and observed
the similarity between them. The similarity is not pronounced in the beginning
as the approximations made to develop equations 4.4 and 4.7 play major role in
smaller networks.
0
0.2
0.4
0.6
0.8
1
400000 600000 800000 1e+06
Hit
Rat
io
Number of Peers
Theo: Evolved Hit RatioEmp: Evolved Hit Ratio
Figure 4.3: Theoretical and empirical hit ratio of a peer against number of peersin the network
4.2.3 Consistency Problem
Periodically exchanging the list of neighbors facilitates the peers to get up-to-date
information about their neighbors. In between two successive updates, a peer
may possibly have erroneous knowledge about it’s neighbors. As a result, this
4.2 Hurdles in Implementing the Scheme 20
inconsistency of the network leads to the presence of 3-length or 4-length cycles.
Parallel update is possible when many peers enter simultaneously or there is a
huge failure/attack in the network whereby many nodes have lost their neighbors
and would now like to gain some.
1. Parallel update: In parallel update, due to inconsistency, smaller length
cycles are formed while multiple peers from the same cycle and any other
peer are handshaked in parallel to become each other’s neighbor. In Fig.
4.4(a), peer-1 and peer-5 execute the following actions according to steps of
HPC5 and form smaller length cycle.
(a) Both peer-1 and peer-5 find that peer-P is a valid remote peer to contact
and both send request to P.
(b) Peer-P gets their request more-or-less at the same time and sends back
the neighbor-hood status to them.
(c) As peer-1 and peer-5 do not know each other’s activity or updated
status, they make P as their new neighbor, therefore a cycle-3 is formed
due to this inconsistency.
Similarly smaller cycles may be created when multiple peers contact each
other as a directed cycle (as in Fig. 4.4(b)) within the period of two succes-
sive updates.
P
1 2
3
45
3
3
2
1
(a) (b)
Figure 4.4: A part of cycle-5 network, representing parallel update inconsistency
4.2 Hurdles in Implementing the Scheme 21
2. Inconsistency arising in the face of failure/attack: Here we discuss about
the topological status of the network when x fraction (where x ¿ 1) of
nodes are left/removed from the network. We assume that the nodes have
left uniformly from the different parts of the network. So each peer loses a
fraction of its neighbors and in effect the average degree of a peer in the net-
work becomes less. To maintain the degree distribution of the network, each
peer contacts other remote ultra-peers to fulfil neighbor deficiency. During
this process, 3-length and 4-length cycles are created temporarily due to
inconsistency between two successive updates. We calculate the effects of
inconsistency due to peers removal from an ultra-peer point-of-view, so in
this section the term peer will be used to represent an ultra-peer.
• 3-length cycle: A 3-length cycle is created if two neighbor peers (say, peer-
1 and peer-5) and another remote peer (say, P) get involved in HPC5 as in
Fig. 4.4. The initiation of handshake protocol in different combinations
among peer-1, peer-5 and P may create triangle. Here in calculation we
are following the combination shown in Fig. 4.4(a). After removal process
Urem = (U − xU) number of ultra-peers remain in the network. According
to HPC5, a peer cannot make any ultra-peers at level Q, R or S in Fig. 4.2
as its neighbor. So the probability of selecting P as neighbor by peer-1 is
P0 =Urem − [duu(1− x)]3
Urem
The probability of choosing the same ultra-peer P as neighbor by any neigh-
bor of peer-1 (here peer-5) is
P1 ≈ 1
Urem − [duu(1− x)]3
4.2 Hurdles in Implementing the Scheme 22
So the probability of forming a 3-length cycle is
Pt = P0 · P1 ≈ 1
Urem
Therefore, the average number of 3-length cycles are created around an ultra-
peer is [(duu + dul)(1− x)/Urem] and total number of 3-length cycles are
formed in the network is
L3 = O((duu + dul)(1− x)
Urem
· Urem)
= O((duu + dul)(1− x)) = O(duu)
(O(f) represents big-oh(f)).
• 4-length cycle: A 4-length cycle is created if P and one of its 2nd ultra-
neighbor or any two 1st neighbors (leaf or ultra) of P contact T and become
neighbors. Similar to 3-length cycles calculation, the average number of
4-length cycles are created around an ultra-peer is
[duu(1− x)]2 + duudul(1− x)2
Urem
and total number of 4-length cycles are formed in the network is
L4 = O([duu(1− x)]2 + duudul(1− x)2) = O(duu2)
So the total number of smaller length cycles are created due to nodes removal
is
L = L3 + L4 = O(duu2)
which is very less compared to network size.
4.3 Performance Evaluation 23
In the same way we can prove the effect of inconsistency from leaf-peer point-
of-view is O(duu · dlu).
The inconsistency is removed in the following manner. After getting next up-
to-date information from neighbors, a peer removes the neighbor/s responsible for
creating smaller cycles. The elimination of smaller length cycles will be acceler-
ated if the gap between two successive updates is minimized. This is a message
overhead which the design engineers have to decide. However the presence of a
small percentage of smaller length cycles in the network is tolerable, as they do
not affect much on the performance.
4.3 Performance Evaluation
To validate our approach, we have performed numerous experiments. We have
taken different sizes (up to 800k nodes) of networks and performed experiments
on those networks several times to obtain the average behavior. Through these
experiments, we have shown that our approach is better than the existing proto-
cols. In this section we have analyzed search performance of cycle-5 and cycle-6
networks through simulation. By these analyses we will get an idea of study the
performance of cycle-r (r > 6) networks.
Metrics: The efficiency of search algorithms can be measured using various met-
rics such as success rate, average number of hops required to get results (response
time), message complexity, network coverage (#nodes discovered), percentage of
message duplication, etc [10]. In our simulations, we have used message complexity
and network coverage as performance metrics.
Definition 4.1 Message complexity is defined as the average number of mes-
sages required to discover a peer in the overlay network.
4.3 Performance Evaluation 24
Definition 4.2 Network coverage implies the number of unique peers explored
during query propagation in limited flooding.
Performance Metrics: We have plotted the network coverage and message
complexity (y-axis) with TTL(2) and TTL(3) flooding respectively against the
size of the network (x-axis). We have shown the performance separately for the
query originated from leaf-peers and ultra-peers. To get the overall performance
of the network, we have chosen the number of ultra-peers and leaf-peers for query
flooding in the same ratio. The performance of the network is greatly influenced
by the value of TTL used in search and thus we have discussed the performance
metrics based on TTL(2) and TTL(3) separately. The search performance (spe-
cially message complexity) also depends on the implementation of QRP technique.
With QRP technique, searching is performed only at the ultra-peer layer, since
ultra-peers contain the indices of their children [1, 2]. So, the measurement of
message complexity at the ultra-peer layer is more appropriate to compare results
with Gnutella networks.
It is clear from Fig. 4.5 that cycle-5 and cycle-6 networks are better than
cycle-3 networks in terms of both message complexity and network coverage. In
cycle-5 & cycle-6 networks, the network coverage is approximately 15-20% more
than that of cycle-3 networks in both TTL(2) and TTL(3) searches which is shown
in Fig. 4.5(a) and 4.5(b). Network coverage of cycle-6 networks is little more than
that of cycle-5 networks. The ultra-peer layer message complexity is shown in
Fig. 4.5(c). In TTL(2) message complexity of both cycle-5 & cycle-6 networks
is very close to 1, whereas in cycle-3 network it is almost 10-15% more. Message
complexity of cycle-3 networks with TTL(3) is almost 6-7% more than that of
cycle-5 networks and almost 20% more than that of cycle-6 networks. Cycle-6
networks show slightly better performance than cycle-5 networks in TTL(3).
Relative search performance of cycle-5 and cycle-6 networks is shown in the
4.3 Performance Evaluation 25
10000
20000
30000
40000
50000
200000 400000 600000 800000
Net
wor
k C
over
age
Number of Peers
TTL-2: Network Coverage
Cycle-3Cycle-5Cycle-6
(a) Network Coverage with TTL(2)
100000
200000
300000
400000
200000 400000 600000 800000
Net
wor
k C
over
age
Number of Peers
TTL-3: Network Coverage
Cycle-3Cycle-5Cycle-6
(b) Network Coverage with TTL(3)
1
2
3
4
200000 400000 600000 800000
Mes
sage
Com
plex
ity
Number of Peers
TTL(2):Cycle-3TTL(3):Cycle-3TTL(2):Cycle-5TTL(3):Cycle-5TTL(2):Cycle-6TTL(3):Cycle-6
(c) Message Complexity
Figure 4.5: Comparison on Network Coverage and Message Complexity
4.4 Network Analysis 26
table 4.1 which reflects that any cycle-r (r > 3) network is better than that of cycle-
3 networks in terms of search performance and larger (r) gives higher performance
with larger TTL search. Gnutella network uses dynamic querying over limited
flooding and uses TTL(3) rarely for rare searches only. Also to maintain cycle-6
networks (HPC6) required more bandwidth. So cycle-5 networks are good enough
in performance to save bandwidth. For that in the next sections we mainly focus
on cycle-5 networks.
Table 4.1: Simulation Results for Search Performance
Cycle-5 Cycle-6
Search Perf.
Network Coverage (TTL-2) 15-20% more 15-20% moreNetwork Coverage (TTL-3) 15-20% more 15-20% moreMessage Complexity (TTL-2) 10-15% less 10-15% lessMessage Complexity (TTL-3) 6-7% less almost 20% less
4.4 Network Analysis
In this section we analyze the evolved networks. We compare the structural prop-
erties of the evolved networks with that of the base network and try to understand
various performance related issues.
Clustering Coefficient: The density of cycles with different sizes are rep-
resented by a more generalized term, called Clustering Coefficient (CC) of the
network. A high CC implies a large number of redundant messages in flood-based
querying [17]. The standard definition of CC gives the probability that two near-
est neighbors of the same node are also mutual neighbors. There is also a general
definition for higher order CC. The CC of order x for the node i is the probability
that there is a path of length x between two neighbors of the node i [5]. If the
4.4 Network Analysis 27
number of x distances is Ei(x), then x order CC of the node i is
Ci(x) =2Ei(x)
ki(ki − 1)
and the CC of the whole network C(x) is average of all Ci(x)′s [5, 12, 13].
We have plotted the distribution of CCs of different order for different type
of networks with 60k peers as in Fig. 4.6. The distribution looks like Gaussian
curve (bell shaped curve) [5]. In case of cycle-3 networks, 2nd order CC is very
high. On the other hand in cycle-5 networks, both 1st and 2nd order CCs become
zero but 3rd and 4th order CCs are very high. From the definition, the value of
rth order CC is related to the density of (r + 2) length cycles. Also during query
propagation, the number of generated redundant message with TTL(r) (r > 1) is
influenced by the CCs of order 1st to (2r−2), which are related to cycles of length
3 to 2r. As a result in cycle-3 networks, query starts to generate lots of redundant
messages from TTL(2) onwards and hence reduces the network coverage. In cycle-
5 network, it shows very good performance with TTL(2) but message complexity
becomes high from TTL(3) onwards. We could have got better performance with
TTL(3) if our algorithm incorporated the capability of tuning distribution of CCs.
Robustness of the topology: In order to test the robustness of the network
evolved through HPC5, we have considered networks with 50k nodes and plotted
the percentage of peers belonging to the largest component against the percentage
of peers removal. We have removed peers in two ways: (i) random removal and (ii)
pathologically removing the highest-degree nodes first [17]. The upper (right) two
curves in Fig. 4.7 represents the effect of random removal, whereas lower (left)
two curves represent pathological removal. Both cycle-3 and cycle-5 networks
are extremely resilient to random removal and the largest connected component
still contains 80% of existing peers even after removing almost 75-80% of nodes.
4.5 Comparison with Zhu’s work 28
1
10
100
1000
10000
100000
1e+06
0 1 2 3 4 5 6 7 8 9
CC
*106 in
logs
cale
Order of CC
Clustering Coefficient Distribution
Cycle-3Cycle-4Cycle-5
Figure 4.6: Distribution of Higher order Clustering Coefficient
The networks gets fragmented only after 85% peers removal. But in pathological
removal, both cycle-3 and cycle-5 networks gets fragmented only after removing
6-7% of total nodes. However, through our empirical study, we can say that
robustness is similar for both cycle-3 and cycle-5 networks.
4.5 Comparison with Zhu’s work
Zhu et al. very recently presented a distributed algorithm DCMP in [18] to elim-
inate a fraction of smaller length cycles from the network. The main differences
between DCMP and HPC5 are described below.
• DCMP is activated by a peer only after getting a duplicate message and it
generates some control messages to detect a suitable edge to delete from the
cycle for which duplicate message was produced. So, the network always
contains some smaller length cycles. On the other hand HPC5 initiated
whenever a new neighbor is required and ensures that the topology is always
free from cycle-3 and cycle-4.
4.5 Comparison with Zhu’s work 29
0
10
20
30
40
50
60
70
80
90
100
1 10 100
Size
of
the
Lar
gest
Com
pone
nt(%
)
Percentage of nodes removed
Robustness on Random and Pathological Removal
Random in Cycle-3
Random in Cycle-5
Patho. in Cycle-3
Patho. in Cycle-5
Figure 4.7: Comparison of Robustness between cycle-3 and cycle-5 Networks
• DCMP is a general approach to eliminate cycles of length up to (2 ∗ TTL).
But they have shown that only removing cycle-3 and cycle-4 provides best
performance which is supporting our cycle-5 networks.
• Due to the elimination of edges in DCMP the average distance between two
peers increases and diameter of the network too. In simulation with 10k
peers they have shown that to cover almost 100% of the network coverage
require TTL(8) flooding. In our simulation we have seen that the diameter
for 10k nodes is 4-5 and to cover whole network TTL(4− 5) is sufficient.
• In DCMP, the networks are not as robust as Gnutella network.
• In DCMP, after elimination no rewiring is followed which decreses the aver-
age degree of peers, as well as the degree distribution of the network.
So, HPC5 generated networks gives best performance in DCMP without disturbing
the parameters and robustness of Gnutella networks.
C H A P T E R 5
Index Table Replication
Data availability in the unstructured P2P networks can be improved in many ways
like better search technique, higher TTL search, replication, etc. If multiple copies
of data or index table of the data files exist in multiple peers, then the chance of
at least one copy being accessible is increased. Our aim in this chapter to increase
query hit probability (QHP) with less latency and search cost. To achieve this we
use here index table replication technique.
Section 5.1 discuss on the replica placement technique (RPT) that Gnutella
network follows. We describe some of the replica placement techniques in Section
5.2 and propose a search mechanism for the replication implemented cycle-5 net-
works in Section 5.3. Section 5.4 estimate the time duration to update replica to
keep consistency. We select an appropriate RPT and number of replica through
simulation in Section 5.5.
5.1 Replication in Gnutella Network 31
5.1 Replication in Gnutella Network
Objects or files are not replicated in the Gnutella network, only peers that request
an object make copy of the object. Gnutella supports only index table replication.
An ultra-peer creates a master index table (MIT) using own index table and the
index tables collected from its connected leaf-peers through QRP protocol. So a
query is forwarded to only those leaf-peers whose index table matches the query.
The replication is performed only at the ultra-peer layer where each ultra-peer in
the Gnutella network exchanges their MIT with immediate neighbor ultra-peers
which saves one TTL search [1].
5.2 Replica Placement Techniques
In each RPT two questions have to be answered,
1. How many replicas of a MIT is required to fulfill the requirements ?
2. where should the new replicas be created ?
It is obvious that in any scheme more number of replicas increases the number of
query hits. But maintaining more replica requires more storage and bandwidth.
In the sequel, different RPTs are being described.
1. Random RPT: In this scheme each ultra-peer replicates its MIT to a con-
stant number of randomly selected ultra-peers. In literature [10] it has been
found that this scheme is very effective in Gnutella like networks.
2. 1st Neighbor set RPT: In this scheme each ultra-peer replicates its MIT to
a subset of its 1st ultra-neighbor set. Current Gnutella protocol follows this
scheme.
5.3 Search Technique (Flooding) in Cycle-5 Networks with Replication32
3. 2nd Neighbor set RPT: In this scheme each ultra-peer replicates its MIT to
a subset of 2nd ultra-neighbors set. This saves two TTL search.
We prefer 2nd-neighbor RPT for our cycle-5 networks and the reasons are described
in section 5.5. So the search technique and replica update procedure are described
with respect to 2nd-neighbor RPT networks.
5.3 Search Technique (Flooding) in Cycle-5 Net-
works with Replication
The replication is performed only at the ultra-peer layer, where each ultra-peer
replicates own MIT to the constant number of other ultra-peers. The search
technique is similar to normal search of cycle-5 networks (dynamic query search
over limited flooding). The goals in this proposed search technique are 1) high
QHP, 2) low latency and 3) low search cost. With replication, search technique
(with TTL(2)) is performed in 5 steps,
1. TTL(2) : A leaf-peer forwards the query to its connected ultra-peers. An
ultra-peer searches in its MIT and forwards query to the selected leaf-peers
(whose index table matches the query) and collects results (say, R1) from
them. Ultra-peer also searches in the replicas for suitable matching of the
query and collects the matched ultra-peer’s addresses (say, Um1). The ultra-
peer sends back the combined results (R1+Um1) to the initiator and forwards
the query to the connected ultra-neighbors with one less TTL .
2. TTL(1) : In this step each ultra-peer has same behavior as in step(1), collects
results (R2 + Um2) and sends back to the initiator through reverse overlay
path. The query is forwarded to the next level of ultra-peers with TTL(0).
5.4 Replica Update 33
3. TTL(0) : The ultra-peer which receives a query with TTL(0) performs the
same activity as step(1), collects and sends back the results (R3 + Um3) to
the initiator through reverse overlay path. No forwarding of query to the
next level ultra-peers is performed in this step.
4. Contact Peers: The initiator continuously collects and checks the results.
Search results present in Ri is confirmed, whereas results in Umi are probable
(because these results are returned by matching the hash table only, which
does not guarantee for exact desired result). That is why the initiator peer
first count the number of results present in (R1 + R2 + R3). If number
of results are sufficient then the query is stopped. Otherwise it starts to
send query with TTL(−1) to the ultra-peers in (Um1 + Um2 + Um3) using
the underlying networks (not overlay networks) in multiple iterations and
collects results. The iteration will go on until sufficient number of results
are collected or all the peers in Umi are contacted.
5. TTL(−1) : The ultra-peer which receives a query with TTL(−1) only searches
in its master index file and forwards query to the selected leaf-peers and col-
lects results (R4) from them and sends them back directly to the initiator
using underlying networks. No replica is sent in this step.
5.4 Replica Update
Periodically each ultra-peer sends its MIT to its 2nd-neighbors to make MIT up-
to-date. Piggyback technique can be used to do this with minimal overhead, in
which an ultra-peer appends its MIT to the results when a query comes from
its 2nd-neighbor. Now we want to calculate the period of update for 2nd-neighbor
RPT to keep the traffic as it is. Gnutella’s TTL(3) search approximately generates
d3uu number of messages for each query. Let, n be the average number of query
5.5 Simulation 34
issued in the network in one unit time. So on average total (n · d3uu) number of
messages are generated in the network at each unit time. On the other hand,
each ultra-peer updates their MITs with d2uu number of ultra-peers in the cycle-5
network with 2nd-neighbor RPT. So in each period of update (say, T unit time)
it expenses (U · d2uu) number of replica-messages and (T · n · d2
uu) number of query
messages (neglecting the number of messages send with TTL(−1)). Therefore, to
keep number of messages as it is
[(U · d2uu) + (T · n · d2
uu)] ≤ (T · n · d3uu)
T ≥ U
n · (duu − 1)(5.1)
The equation 5.1 indicates that T is inversely proportional to the query rate (n).
5.5 Simulation
We simulate different RPTs to find appropriate RPT. We also investigate the
effects of number of replicas in the search performance. For experimentation we
have taken several cycle-5 networks of size 100k-1000k.
5.5.1 RPT Selection:
QHP with TTL(2) of different RPTs with a constant number of replicas (in our
simulation it is 20) and without replication has been shown in the Fig. 5.1(a).
QHP with replication is 10-20 times higher than without replication. In the figure
both random and 2nd-neighbor RPTs are almost same. The QHP of 1st-neighbor
RPT is almost 10% lesser than that of 2nd-neighbor RPT. So in this respect
random and 2nd-neighbor RPTs are slightly better than that of 1st-neighbor RPT.
On the other hand we cannot increase the number of replicas for each MIT above
5.5 Simulation 35
duu in case of 1st-neighbor RPT. So even we have sufficient bandwidth for more
replicas or efficient utilization (using piggyback technique to exchange replicas) of
bandwidth, we can not reach to the desired QHP through 1st-neighbor RPT with
lower TTL (Fig. 5.1(b)).
In the Fig. 5.1(b) we have shown the QHP without and with different RPTs.
In simulation of random RPT, each ultra-peer replicates its MIT to 500 number of
other ultra-peers which are selected randomly. As described in paper [10] random
RPT performs outstanding. The QHP in random RPT with TTL(2) for smaller
size networks is almost 1, even for larger networks (1000k peers) it is more than
80%. In 2nd neighbor RPT each ultra-peer replicates its MIT to all of its 2nd ultra-
neighbors. So average number of replicas of each MIT is (d2uu). In this technique,
the QHP (with TTL(2)) for smaller number of replicas is almost similar to random
RPT. However, for large number of replicas QHP of 2nd neighbor RPT is little
less (in our simulation it is almost 10%) than that of random RPT. The QHP
with TTL(1) in 2nd neighbor RPT is equivalent to QHP of TTL(3) search without
replication implementation.
Although in simulation, random RPT is better than 2nd-neighbor RPT, we
prefer to follow later one because of the limitation in uniform random ultra-peer
selection in Gnutella network without having global knowledge. The ultra-peer
selection depends on how a peer discovers other online peers. The ultra-peer se-
lection is not uniform even if a peer collects list of ultra-peers from hardcoded
addresses and/or from GwebCache systems. So in real networks the apparent ran-
dom selection is biased and the randomness occurs within a small set of ultra-peers
due to unavailability of global knowledge. On the other hand 2nd-neighbor RPT
is deterministic and maintaining replicas are also easy as availability of updated
2nd-neighbors in period basis.
To summarize, the benefits from 2nd-neighbor RPT are as follows. First, the
5.5 Simulation 36
0
0.1
0.2
0.3
200000 400000 600000 800000
Que
ry H
it Pr
obab
ility
Number of Peers
Random RPT2nd Neighbor RPT1st Neighbor RPT
Without Replication
(a) Comparison of RPTs for 20 Replicas with TTL(2)search
0
0.2
0.4
0.6
0.8
1
200000 400000 600000 800000 1e+06
Que
ry H
it Pr
obab
ility
Number of Peers
Without Replication: TTL(3)Random Replication: TTL(2)
2nd-Neighbor Replication: TTL(2)2nd-Neighbor Replication: TTL(1)1st-Neighbor Replication: TTL(2)1st-Neighbor Replication: TTL(3)
(b) Comparison of RPTs
0
0.2
0.4
0.6
0.8
1
100 200 300 400
Que
ry H
it Pr
obab
ility
Number of Replica
2nd-Neighbor ReplicationRandom Replication
(c) QIP Based on No. of Replicas with TTL(2)
Figure 5.1: Query Hit probability
5.5 Simulation 37
QHP increases by almost 4-6 times with respect to normal TTL(3) search. Second,
the search latency is minimized. It is actually doing search two TTLs ahead. As
a result the popular searches complete in TTL(1) and rare searches in TTL(2).
Third, search cost is minimal. The search is performed with maximum of TTL(2)
and in cycle-5 networks the message complexity at TTL(2) is nearly 1. On the
other side, it uses physical network to contact the peers in Umi. As a result
the message complexity of the whole search becomes close to 1. Fourth, the
search returns only the required number of result (not much more results than
necessary, overshooting problem). In the TTL(−1) step, the search is performed
incrementally and stops after getting required number of results. This benefit
conflicts with the latency. Due to use of multiple iterations in this step search
latency may be high. So the design engineers decide the number of contacts at
each iteration in TTL(−1) step to balance the latency.
5.5.2 Number of Replica Selection
In Fig. 5.1(c) we have plotted QHP based on number of replicas of each MIT for
both random and 2nd-neighbor RPT. For this experimentation we take a cycle-5
network of 500k peers and in each iteration the number of replica increases by 10.
In the figure QHP increases very rapidly at the initial stages and reaches almost at
0.8 in random RPT (0.7 in 2nd-neighbor RPT) with 200 number (which is almost
0.5∗d2uu) of replicas. Hereafter both the curves rise slowly as the number of replicas
increases. So depending upon the requirement designers select optimum number
of replicas.
C H A P T E R 6
Hit Ratio Based Design
Hit ratio is one of the main hurdle in the implementation of our approach. In the
equation 4.7, the average evolved hit ratio inversely indicates the cost of neighbor
selection. In some applications/protocols higher hit ratio is required or in some
protocol the cost of collecting information about online ultra-peers is large. The
problem of hit ratio can be solved by providing some level of flexibility in the
neighbor selection. A peer may be allowed to select some neighbors in spite of
creating smaller length cycles in the network. In other words, some neighbors of
a peer are selected using HPC5 and rest of the neighbors are randomly selected,
which may form smaller length cycles.
Let m be the average ultra-degree of a peer and at most x fraction of ultra-
neighbors follow HPC5 to get desired hit ratio. So, mx number of neighbor selec-
tion is done with hit ratio 〈Hev〉 and rest of the neighbors with hit ratio 1 (as they
are selected in random). Therefore average number of iterations required to get a
valid neighbor is
Tx <1
m· [mx · 1
〈Hev〉 + m(1− x) · 1]
39
Let 〈Hd〉 be the desired hit ratio. Therefore,
1
〈Hd〉 ≤ Tx
1
〈Hd〉 <1
m· [mx · 1
〈Hev〉 + m(1− x) · 1]
x >
[1
〈Hd〉 − 1
]/[1
〈Hev〉 − 1
](6.1)
From the equation 6.1 we can claim that if at most
rx =
[1
〈Hd〉 − 1
]/[1
〈Hev〉 − 1
]
fraction of ultra-neighbors of each peer follows HPC5 then the resultant hit ratio
will be at least 〈Hd〉.We have plotted desired (〈Hd〉) and experimental hit ratio with respect to rx
in Fig. 6.1(a). For this experiment we take a cycle-5 network of 500k peers and
calculate the minimum hit ratio (which is our initial 〈Hd〉) using the equation
4.7. Then after in each iteration we increase 〈Hd〉 by 0.05, calculate rx and find
experimental hit ratio. In the Fig. 6.1(a) experimental results are always higher
than desired hit ratio (〈Hd〉). The gap between two curves initiates the question
of fine-tuning of value rx. In our future work we want to include this.
In the Fig. 6.1(b) and 6.1(c) we have shown the relative performance of the
networks with different rx with respect to the performance of cycle-5 networks
(in which all neighbors of each peer follow HPC5). According to the expectation
the performance of the network in both network coverage and message complexity
degrades with rx decreases. In TTL(3) the degradation rate is less than TTL(2).
So there is a trade-off between performance and average hit ratio of the net-
work. A system designer sets the hit ratio depending on the requirement of the
networks/applications.
40
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Hit
Rat
io
Fraction of Neighbors follows HPC5
Desired Hit RatioHit Ratio Obtained
(a) Hit Ratio
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Rat
io o
f N
odes
Dis
cove
ry
Fraction of Neighbors follows HPC5
Nodes Disc with TTL-2Nodes Disc with TTL-3
(b) Network Coverage
1
1.1
1.2
0 0.2 0.4 0.6 0.8 1
Rat
io o
f M
esg.
Com
plex
ity
Fraction of Neighbors follows HPC5
Mesg Comp with TTL-2Mesg Comp with TTL-3
(c) Message Complexity
Figure 6.1: Effects of rx in the network
C H A P T E R 7
Conclusion and Future work
7.1 Conclusion
In my dissertation, we have presented a handshake protocol which is compatible
with Gnutella like unstructured two-tier overlay topology. We have shown that
the protocol is far more efficient than existing protocols. A relation among TTL,
minimum cycle length in the topology and network performance has been observed
and proposed. Our proposed replica placement technique is useful for popular and
rare searches with less latency and search cost.
A major fraction of internet bandwidth is occupied by Gnutella-like unstruc-
tured popular networks. P2P implementation of the 2nd generation web applica-
tions requires a huge internet bandwidth which initiates the optimum utilization
of bandwidth. In this regard our protocol can be instrumental in improving the
scalability of P2P networks.
7.2 Future Works 42
7.2 Future Works
In simulation we have shown the distribution of higher order CCs but our algorithm
does not have any control parameter to tune it to a desired distribution. In our
future work we want to develop such an algorithm. We also want to model our
approach theoretically as our future work.
7.2 Future Works 43
References
[1] Gnutella and limewire: www.limewire.org.
[2] The gnutella protocol specification 0.6: http://rfc-gnutella.sourceforge.net.
[3] Gnutella: www.gnutellaforums.com.
[4] Gwebcache system: www.gnucleus.com.
[5] Fronczak, A., Holyst, J. A., Jedynak, M., and Sienkiewicz, J. Higher order
clustering coefficients in barabasi-albert networks. Physica A 316 (December 2002), 688–
694.
[6] Karbhari, P., Ammar, M. H., Dhamdhere, A., Raj, H., Riley, G. F., and Zegura,
E. W. Bootstrapping in gnutella: A measurement study. In PAM (2004), vol. 3015 of
Lecture Notes in Computer Science, Springer, pp. 22–32.
[7] Liu, Liu, Xiao, Ni, and Zhang. Location-aware topology matching in P2P systems. In
INFOCOM: The Conference on Computer Communications, joint conference of the IEEE
Computer and Communications Societies (2004).
[8] Liu, Xiao, Liu, Ni, and Zhang. Location awareness in unstructured peer-to-peer systems.
IEEETPDS: IEEE Transactions on Parallel and Distributed Systems 16 (2005).
[9] Lua, K., Crowcroft, J., Pias, M., Sharma, R., and Lim, S. A survey and compari-
son of peer-to-peer overlay network schemes. Communications Surveys & Tutorials, IEEE
(2005), 72–93.
[10] Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. Search and replication in
unstructured peer-to-peer networks. In Proceedings of the 2002 International Conference
on Supercomputing (16th ICS’02) (June 2002), ACM, pp. 84–95.
[11] Merugu, S., Srinivasan, S., and Zegura, E. W. Adding structure to unstructured
peer-to-peer networks: The role of overlay topology. In NGC 2003, and ICQT 2003, Pro-
ceedings (2003), vol. 2816 of Lecture Notes in Computer Science, Springer, pp. 83–94.
[12] Newman. The structure and function of complex networks. SIREV: SIAM Review 45
(2003).
[13] Newman, M. E. J. Models of the small world: A review, May 09 2000.
7.2 Future Works 44
[14] Papadakis, C., Fragopoulou, P., Athanasopoulos, E., Dikaiakos, M. D.,
Labrinidis, A., and Markatos, E. A feedback-based approach to reduce duplicate
messages in unstructured peer-to-peer networks. In Integrated Research in GRID Comput-
ing. February 2007.
[15] Saroiu, S., Gummadi, P. K., and Gribble, S. D. A measurement study of peer-to-peer
file sharing systems. Tech. rep., July 23 2002.
[16] Stutzbach, D., and Rejaie, R. Capturing accurate snapshots of the gnutella network.
IEEE, pp. 2825–2830.
[17] Stutzbach, D., Rejaie, R., and Sen, S. Characterizing unstructured overlay topologies
in modern p2p file-sharing systems. In Internet Measurment Conference (2005), USENIX
Association, pp. 49–62.
[18] Zhenzhou, Z., Panos, K., and Spiridon, B. Dcmp: A distributed cycle minimization
protocol for peer-to-peer networks. In Parallel and Distributed Systems, IEEE Transactions
(2008), vol. 19, IEEE, pp. 363–367.