p2p, fall 05 1 topics in database systems: data management in peer-to-peer systems search &...

1p2p, Fall 05

Topics in Database Systems: Data Management in Peer-to-Peer Systems

Search & Replication in Unstructured P2p

2p2p, Fall 05

Overview

Centralized

Constantly-updated directory hosted at central locations (do not scale well, updates, single points of failure)

Decentralized but structured

The overlay topology is highly controlled and files (or metadata/index) are not placed at random nodes but at specified locations

Decentralized and Unstructured

Peers connect in an ad-hoc fashion

The location of document/metadata is not controlled by the system

No guaranteed for the success of a search

No bounds on search time

3p2p, Fall 05

Overview

Blind Search and Variations

No information about the location of items

Informed Search

Maintain (localized) index information

Local and Routing Indexes

Trade-off: cost of maintaining the indexes (when joining/leaving/updating) vs cost for search

4p2p, Fall 05

Blind Search

Flood-based

each node contact its neighbors, which in turn contact their neighbors, until the item is located

Exponential search time

No guarantees

5p2p, Fall 05

Blind Search: Issues

BFS vs DFS: better response time, more messages

Iterative vs Recursive (return path)

TTL (time to leave) parameter

Cycles (duplicate messages)

Connectivity

Power-Law Topologies: the ith node with most connections has ω/ia neighbors

6p2p, Fall 05

Gnutella: Summary

• Completely decentralized• Hit rates are high• High fault tolerance• Adopts well and dynamically to changing peer populations• Protocol causes high network traffic (e.g., 3.5Mbps). For example:

– 4 connections C / peer, TTL = 7– 1 ping packet can cause packets

• No estimates on the duration of queries can be given• No probability for successful queries can be given

• Topology is unknown algorithms cannot exploit it

• Free riding is a problem• Reputation of peers is not addressed

• Simple and robust

240,26)1(**20

TTL

i

iCC

7p2p, Fall 05

Summary and Comparison of Approaches

Paradigm Search TypeSearch Cost (messages)

Autonomy

GnutellaBreadth-first search on graph

String comparison

very high

FreeNetDepth-first search on graph

String comparison

O(Log n) ? very high

ChordImplicit binary search trees

Equality O(Log n) restricted

CANd-dimensional space

Equality O(d n (̂1/d)) high

P-GridBinary prefix trees

Prefix O(Log n) high

TTL

i

iCC0

)1(**2

8p2p, Fall 05

More on Search

Search Options– Query Expressiveness (type of queries)– Comprehensiveness (all or just the first (or k) results)

– Topology– Data Placement– Message Routing

9p2p, Fall 05

Comparison

Gnutella CAN Others?

Expressivness

Comprehensivness

Autonomy

Efficiency

Robustness

Topology pwr law

Data Placement arbitrary

Message Routing flooding

10p2p, Fall 05

Comparison

Gnutella CAN Others?

Expressivness

Comprehensivness

Autonomy

Efficiency

Robustness

Topology pwr law grid

Data Placement arbitrary hashing

Message Routing flooding directed

11p2p, Fall 05

• Client-Server performs well– But not always feasible

• Ideal performance is often not the key issue!

• Things that flood-based systems do well– Scaling– Decentralization of visibility and liability– Finding popular stuff (e.g., caching)– Fancy local queries

• Things that flood-based systems do poorly– Finding unpopular stuff – Fancy distributed queries– Guarantees about anything (answer quality, privacy, etc.)

12p2p, Fall 05

Blind Search Variations

Expanding Ring or Iterative Deepening:

Start BFS with a small TTL and repeat the BFS at increasing depths if the first BFS fails

Works well when there is some stop condition and a “small” flood will satisfy the query

Else even bigger loads than standard flooding

Appropriate when hot objects are replicated more widely than cold objects

Modified-BFS:

Choose only a ratio of the neighbors (some random subset)

13p2p, Fall 05

Blind Search MethodsRandom Walks:

The node that poses the query sends out k query messages to an equal number of randomly chosen neighbors

Each step follows each own path at each step randomly choosing one neighbor to forward it

Each path – a walker

Two methods to terminate each walker: TTL-based or

checking method (the walkers periodically check with the query source if the stop condition has been met)

It reduces the number of messages to k x TTL in the worst case

Some kind of local load-balancing

14p2p, Fall 05

Blind Search Methods

Random Walks

In addition, the protocol bias its walks towards high-degree nodes (choose the highest degree neighbor)

15p2p, Fall 05

Topics in Database Systems: Data Management in Peer-to-Peer Systems

Q. Lv et al, “Search and Replication in Unstructured Peer-to-Peer Networks”, ICS’02

16p2p, Fall 05

Search and Replication in Unstructured Peer-to-Peer Networks

Type of replication depends on the search strategy used

(i) A number of blind-search variations of flooding

(ii) A number of (metadata) replication strategies

Evaluation Method: Study how they work for a number of different topologies and query distributions

17p2p, Fall 05

Methodology

Performance of search depends on

Network topology: graph formed by the p2p overlay network

Query distribution: the distribution of query frequencies for individual files

Replication: number of nodes that have a particular file

Assumption: fixed network topology and fixed query distribution

Results still hold, if one assumes that the time to complete a search is short compared to the time of change in network topology and in query distribution

18p2p, Fall 05

Network Topology

(1) Power-Law Random Graph

A 9239-node random graph

Node degrees follow a power law distribution

when ranked from the most connected to the least, the i-th ranked has

ω/ia, where ω is a constant

Once the node degrees are chosen, the nodes are connected randomly

19p2p, Fall 05

Network Topology

(2) Normal Random Graph

A 9836-node random graph

20p2p, Fall 05

Network Topology

(3) Gnutella Graph (Gnutella)

A 4736-node graph obtained in Oct 2000

Node degrees roughly follow a two-segment power law distribution

21p2p, Fall 05

Network Topology

(4) Two-Dimensional Grid (Grid)

A two dimensional 100x100 grid

22p2p, Fall 05

Query Distribution

Assume m objects

Let qi be the relative popularity of the i-th object (in terms of queries issued for it)

Values are normalized Σ i=1, m qi = 1

(1) Uniform: All objects are equally popular

qi = 1/m

(2) Zipf-like

qi 1 / iα

23p2p, Fall 05

Query Distribution & Replication

When the replication is uniform, the query distribution is irrelevant (since all objects are replicated by the same amount, search times are equivalent for both hot and cold items)

When the query distribution is uniform, all three replication distributions are equivalent (uniform!)

Thus, three relevant combinations query-distribution/replication

(1) Uniform/Uniform

(2) Zipf-like/Proportional

(3) Zipf-like/Square-root

24p2p, Fall 05

Metrics

Pr(success): probability of finding the queried object before the search terminates

#hops: delay in finding an object as measured in number of hops

25p2p, Fall 05

Metrics

#msgs per node: Overhead of an algorithm as measured in average number of search messages each node in the p2p has to process

#nodes visited

Percentage of message duplication

Peak #msgs: the number of messages that the busiest node has to process (to identify hot spots)

These are per-query measures

An aggregated performance measure, each query convoluted with its probability

26p2p, Fall 05

Limitation of Flooding

There are many duplicate messages (due to cycles) particularly in high connectivity graphs

Multiple copies of a query are sent to a node by multiple neighbors

Avoiding cycles, decreases the number of links

Duplicated messages can be detected and not forwarded - BUT, the number of duplicate messages can still be excessive and worsens as TTL increases

Choice of TTL

Too low, the node may not find the object, even if it exists, too high, burdens the network unnecessarily

27p2p, Fall 05

Limitation of Flooding: Comparison of the topologies

Power-law and Gnutella-style graphs particularly bad with flooding

Highly connected nodes means higher duplication messages, because many nodes’ neighbors overlap

Random graph best,

Because in truly random graph the duplication ratio (the likelihood that the next node already received the query) is

the same as the fraction of nodes visited so far, as long as that fraction is small

Random graph better load distribution among nodes

28p2p, Fall 05

Random Walks

Experiments show that

16 to 64 walkers give good results

checking once at every 4th step a good balance between the overhead of the checking message and

the benefits of checking

Keeping state (when the same query reaches a node, the node chooses randomly a different neighbor to forward it)

Improves Random and Grid by reducing up to 30% the message overhead and up to 30% the number of hops

Small improvements for Gnutella and PLRG

29p2p, Fall 05

Random Walks

When compared to flooding:

The 32-walker random walk reduces message overhead by roughly two orders of magnitude for all queries across all network topologies at the expense of a slight increase in the number of hops (increasing from 2-6 to 4-15)

When compared to expanding ring,

The 32-walkers random walk outperforms expanding ring as well, particularly in PLRG and Gnutella graphs

30p2p, Fall 05

Principles of Search

Adaptive termination is very important

Expanding ring or the checking method

Message duplication should be minimized

Preferably, each query should visit a node just once

Granularity of the coverage should be small

Increase of each additional step should not significantly increase the number of nodes visited

31p2p, Fall 05

Replication

32p2p, Fall 05

Types of Replication

Two types of replication

Metadata/Index: replicate index entries

Data/Document replication: replicate the actual data (e.g., music files)

33p2p, Fall 05

Types of Replication

Caching vs Replication

Cache: Store data retrieved from a previous request (client-initiated)

Replication: More proactive, a copy of a data item may be stored at a node even if the node has not requested it

34p2p, Fall 05

Reasons for Replication

Reasons for replication

Performance

load balancing

locality: place copies close to the requestor

geographic locality (more choices for the next step in search)

reduce number of hops

Availability

In case of failures

Peer departuresBesides storage, cost associated with replication: Consistency Maintenance

Make reads faster in the expense of slower writes

35p2p, Fall 05

• No proactive replication (Gnutella)– Hosts store and serve only what they requested– A copy can be found only by probing a host with a

copy

• Proactive replication of “keys” (= meta data + pointer) for search efficiency (FastTrack, DHTs)

• Proactive replication of “copies” – for search and download efficiency, anonymity. (Freenet)

36p2p, Fall 05

Issues

Which items (data/metadata) to replicate

Based on popularity

In traditional distributed systems, also rate of read/write

cost benefit:

the ratio: read-savings/write-increase

Where to replicate (allocation schema)

More Later

37p2p, Fall 05

Issues

How/When to update

Both data items and metadata

38p2p, Fall 05

“Database-Flavored” Replication Control Protocols

Lets assume the existence of a data item x with copies x1, x2, …, xn

x: logical data item

xi’s: physical data items

A replication control protocol is responsible for mapping each read/write on a logical data item (R(x)/W(x)) to a set of read/writes on a (possibly) proper subset of the physical data item copies of x

39p2p, Fall 05

One Copy Serializability

Correctness

A DBMS for a replicated database should behave like a DBMS managing a one-copy (i.e., non-replicated) database insofar as users can tell

One-copy serializable (1SR)

the schedule of transactions on a replicated database be equivalent to a serial execution of those transactions on a one-copy database

One-copy schedule: replace operation of data copies with operations on data items

40p2p, Fall 05

ROWA

Read One/Write All (ROWA)

A replication control protocol that maps each read to only one copy of the item and each write to a set of writes on all physical data item copies.

Even if one of the copies is unavailable an update transaction cannot terminate

41p2p, Fall 05

Write-All-Available

Write-all-available

A replication control protocol that maps each read to only one copy of the item and each write to a set of writes on all available physical data item copies.

42p2p, Fall 05

Quorum-Based Voting

Read quorum Vr and a write quorum Vw to read or write a data item

If a given data item has a total of V votes, the quorums have to obey the following rules:

1. Vr + Vw > V

2. Vw > V/2

Rule 1 ensures that a data item is not read or written by two transactions concurrently (R/W)

Rule 2 ensures that two write operations from two transactions cannot occur concurrently on the same data item (W/W)

43p2p, Fall 05

Distributing Writes

Immediate writes

Deffered writes

Access only one copy of the data item, it delays the distribution of writes to other sites until the transaction has terminated and is ready to commit.

It maintains an intention list of deferred updates

After the transaction terminates, it send the appropriate portion of the intention list to each site that contains replicated copies

Optimizations – aborts cost less – may delay commitment – delays the detection of conflicts

Primary or master copy

Updates at a single copy per item

44p2p, Fall 05

Eager vs Lazy Replication

Eager replication: keeps all replicas synchronized by updating all replicas in a single transaction

Lazy replication: asynchronously propagate replica updates to other nodes after the replicating transaction commits

In p2p, lazy replication (or soft state)

45p2p, Fall 05

Update PropagationWho initiates the update:

Push by the server item (copy) that changes

Pull by the client holding the copy

When

Periodic

Immediate

Lazy: when an inconsistency is detected

Threshold-based: Freshness (e.g., number of updates or actual time)

Value

Expiration-Time: Items expire (become invalid) after that time (most often used in p2p)

Stateless or State-full (the “item-owners” know which nodes holds copies of the item)

46p2p, Fall 05

Replication & Structured P2P

47p2p, Fall 05

CHORD

Invariant to guarantee correctness of lookups:

Keep successors nodes up-to-date

Method: Maintain a successor list of its “r” nearest successors on the Chord ring

Why? Availability

How to keep it consistent: Lazy thought a periodic stabilization

Metadata replication or redundancy

48p2p, Fall 05

CHORD

Method: Replicate data associated with a key at the k nodes succeeding the key

Why? Availability

Data replication

49p2p, Fall 05

CAN

Multiple realities

With r realities each node is assigned r coordinated zones, one on every reality and holds r independent neighbor sets

Replicate the hash table at each reality

Availability: Fails only if nodes at both r nodes fail

Performance: Better search, choose to forward the query to the neighbor with coordinates closest to the destination

Metadata replication

50p2p, Fall 05

CAN

Overloading coordinate zones

Multiple nodes may share a zone

The hash table may be replicated among zones

Higher availability

Performance: choices in the number of neighbors, can select nodes closer in latency

Cost for Consistency


51p2p, Fall 05

CAN

Multiple Hash Functions

Use k different hash functions to map a single key onto k points in the coordinate space

Availability: fail only if all k replicas are unavailable

Performance: choose to send it to the node closest in the coordinated space or send query to all k nodes in parallel (k parallel searches)

Cost for Consistency

Query traffic (if parallel searches)


52p2p, Fall 05

CAN

Hot-spot Replication

A node that finds it is being overloaded by requests for a particular data key can replicate this key at each of its neighboring nodes

Then with a certain probability can choose to either satisfy the request or forward it

Performance: load balancing


53p2p, Fall 05

CAN

Caching

Each node maintains a a cache of the data keys it recently accessed

Before forwarding a request, it first checks whether the requested key is in its cache, and if so, it can satisfy the request without forwarding it any further

Number of cache entries per key grows in direct proportion to its popularity


54p2p, Fall 05

Replication Theory: Replica Allocation

Policies

55p2p, Fall 05

Question: how to use replication to improve search efficiency in unstructured networks with a proactive replication mechanism?

Replication: Allocation Scheme

How many copies of each object so that the search overhead for the object is minimized, assuming that the total amount of storage for objects in the network is fixed

56p2p, Fall 05

Replication Theory

Assume m objects and n nodes Each object i is replicated on ri distinct nodes and the total number of objects stored is R, that is

Σ i=1, m ri = R

Also, pi = ri/R

Assume that object i is requested with relative rates qi, we normalize it by setting

Σ i=1, m qi = 1

For convenience, assume 1 << ri n and that q1 q2 … qm

57p2p, Fall 05

Replication Theory

Assume that searches go on until a copy is found

Searches consist of

randomly probing sites until the desired object is found: search at each step draws a node uniformly at random and asks for a copy

58p2p, Fall 05

Search Example

2 probes 4 probes

59p2p, Fall 05

Replication Theory

The probability Pr(k) that the object is found at the k’th probe is given

Pr(k) =

Pr(not found in the previous k-1 probes) Pr(found in one (the kth) probe) =

(1 – ri/n)k-1 * ri/n

k (search size: step at which the item is found) is a random variable with geometric distribution and θ = ri/n =>

expectation n/ri

60p2p, Fall 05

Replication Theory

Ai: Expectation (average search size) for object i is the inverse of the fraction of sites that have replicas of the object

Ai = n/ri

The average search size A of all the objects (average number of nodes probed per object query)

A = Σi qi Ai = n Σi qi/ri

Minimize: A = n Σi qi/ri

61p2p, Fall 05

Replication Theory

If we have no limit on ri, replicate everything everywhere

Then, the average search size

Ai = n/ri = 1

Search becomes trivial

How to allocate these R replicas among the m objects: how many replicas per object

Assume a limit on R and that the average number of replicas per site ρ = R/n is fixed

62p2p, Fall 05

Uniform Replication

Create the same number of replicas for each object

ri = R/m

Average search size for uniform replication

Ai = n/ri = m/ρ

Auniform = Σi qi m/ρ = m/ρ (m n/R)

Which is independent of the query distribution

It makes sense to allocate more copies to objects that are frequently queried, this should reduce the search size for the more popular objects

63p2p, Fall 05

Proportional Replication

Create a number of replicas for each object proportional to the query rate

ri = R qi

64p2p, Fall 05

Uniform and Proportional Replication

Summary:

• Uniform Allocation: pi = 1/m•Simple, resources are divided equally

• Proportional Allocation: pi = qi

•“Fair”, resources per item proportional to demand• Reflects current P2P practices

Example: 3 items, q1=1/2, q2=1/3, q3=1/6

Uniform Proportional

65p2p, Fall 05

Proportional Replication

Number of replicas for each object:

ri = R qi

Average search size for uniform replication

Ai = n/ri = n/R qi

Aproportioanl = Σi qi n/R qi = m/ρ = Auniform

again independent of the query distribution

Why? Objects whose query rate are greater than average (>1/m) do better with proportional, and the other do better with uniform

The weighted average balances out to be the same

So what is the optimal way to allocate replicas so that A is minimized?

66p2p, Fall 05

Space of Possible Allocations

q i+1/q i ? p i+1/p iAs the query rate decreases, how much does the ratio of allocated replicas behave

Reasonable:p i+1/p i 1

=1 for uniform

67p2p, Fall 05

Space of Possible Allocations

Definition: Allocation p1, p2, p3,…, pm is “in-between” Uniform and Proportional if

for 1< i <m, q i+1/q i < p i+1/p i < 1

(=1 for uniform, = for proportial, we want to favor popular but not too much)

Theorem1: All (strictly) in-between strategies are (strictly) better than Uniform and Proportional

Theorem2: p is worse than Uniform/Proportional if for all i, p i+1/p i > 1 (more popular gets less) OR

for all i, q i+1/q i > p i+1/p i (less popular gets less than “fair share”)

Proportional and Uniform are the worst “reasonable” strategies

68p2p, Fall 05

q2/q1

p2/p

1Space of allocations on 2 items

Worse than prop/uniMore popular item gets less.

Worse than prop/uni

More popular gets more thanits proportional share

Better than prop/uni

Uniform

Proportional

SR

69p2p, Fall 05

So, what is the best strategy?

70p2p, Fall 05

Square-Root Replication

Find ri that minimizes A,

A = Σi qi Ai = n Σi qi/ri

This is done for ri = λ √qi where λ = R/Σi √qi

Then the average search size is

Aoptimal = 1/ρ (Σi √qi)2

71p2p, Fall 05

How much can we gain by using SR ?

wi iq Zipf-like query rates

Auniform/ASR

72p2p, Fall 05

Other Metrics: Discussion

Utilization rate, the rate of requests that a replica of an object i receives

Ui = R qi/ri

For uniform replication, all objects have the same average search size, but replicas have utilization rates proportional to their query rates

Proportional replication achieves perfect load balancing with all replicas having the same utilization rate, but average search sizes vary with more popular objects having smaller average search sizes than less popular ones

73p2p, Fall 05

Replication: Summary

74p2p, Fall 05

Pareto Distribution (for the queries)

Pareto principle: 80-20 rule

80% of the wealth owned by 20% of the population

Zipf: what is the size of the rth ranked

Pareto: how many have size > r

75p2p, Fall 05

Replication (summary)

Each object i is replicated on ri nodes and the total number of objects stored is R, that is

Σ i=1, m ri = R

(1) Uniform: All objects are replicated at the same number of nodes

ri = R/m

(2) Proportional: The replication of an object is proportional to the query probability of the object

ri qi

(3) Square-root: The replication of an object i is proportional to the square root of its query probability qi

ri √qi

76p2p, Fall 05

What is the search size of a query ?

Soluble queries: number of probes until answer is found.

Insoluble queries: maximum search size

Query is soluble if there are sufficiently many copies of the item.

Query is insoluble if item is rare or non existent.

Assumption that there is at least one copy per object

77p2p, Fall 05

• SR is best for soluble queries

• Uniform minimizes cost of insoluble queries

OPT is a hybrid of Uniform and SR

Tuned to balance cost of soluble and insoluble queries.

What is the optimal strategy?

78p2p, Fall 05

UniformSR

10^4 items, Zipf-like w=1.5

All Soluble

85% Soluble

All Insoluble

79p2p, Fall 05

We now know what we need.

How do we get there?

80p2p, Fall 05

Replication Algorithms

• Fully distributed where peers communicate through random probes; minimal bookkeeping; and no more communication than what is needed for search.

• Converge to/obtain SR allocation when query rates remain steady.

Uniform and Proportional are “easy”– Uniform: When item is created, replicate its key in a fixed

number of hosts.– Proportional: for each query, replicate the key in a fixed

number of hosts (need to know or estimate the query rate)

Desired properties of algorithm:

81p2p, Fall 05

Replication - Implementation

Two strategies are popular

Owner Replication

When a search is successful, the object is stored at the requestor node only (used in Gnutella)

Path Replication

When a search succeeds, the object is stored at all nodes along the path from the requestor node to the provider node (used in Freenet)

Following the reverse path back to the requestor

82p2p, Fall 05

Achieving Square-Root Replication

How can we achieve square-root replication in practice?

Assume that each query keeps track of the search size

Each time a query is finished the object is copied to a number of sites proportional to the number of probes

On average object i will be replicated on c n/ri times each time a query is issued (for some constant c)

It can be shown that this gives square root

83p2p, Fall 05

Replication - Conclusion

Thus, for Square-root replication

an object should be replicated at a number of nodes that is proportional to the number of probes that the search required

84p2p, Fall 05


If a p2p system uses k-walkers,

the number of nodes between the requestor and the provider node is 1/k of the total nodes visited (number of probes)

Then, path replication should result in square-root replication

Problem: Tends to replicate nodes that are topologically along the same path

85p2p, Fall 05


Random Replication

When a search succeeds, we count the number of nodes on the path between the requestor and the provider

Say p

Then, randomly pick p of the nodes that the k walkers visited to replicate the object

Harder to implement

86p2p, Fall 05

Achieving Square-Root Replication

What about replica deletion?

Steady state: creation time equal with the deletion time

The lifetime of replicas must be independent of object identity or query rate

FIFO or random deletions is ok

LRU or LFU no

87p2p, Fall 05

Replication: Evaluation

Study the three replication strategies in the Random graph network topology

Simulation Details

• Place the m distinct objects randomly into the network

• Query generator generates queries according to a Poisson process at 5 queries/sec

• Zipf-distribution of queries among the m objects (with a = 1.2)

• For each query, the initiator is chosen randomly

• Then a 32-walker random walk with state keeping and checking every 4 steps

• Each sites stores at most objAllow (40) objects

• Random Deletion

• Warm-up period of 10,000 secs

• Snapshots every 2,000 query chunks

88p2p, Fall 05

Replication: Evaluation

For each replication strategy

What kind of replication ratio distribution does the strategy generate?

What is the average number of messages per node in a system using the strategy

What is the distribution of number of hops in a system using the strategy

89p2p, Fall 05

Evaluation: Replication Ratio

Both path and random replication generates replication ratios quite close to square-root of query rates

90p2p, Fall 05

Evaluation: Messages

Path replication and random replication reduces the overall message traffic by a factor of 3 to 4

91p2p, Fall 05

Evaluation: Hops

Much of the traffic reduction comes from reducing the number of hops

Path and random, better than owner

For example, queries that finish with 4 hops, 71% owner, 86% path, 91% random

92p2p, Fall 05

Summary

• Random Search/replication Model: probes to “random” hosts

• Proportional allocation – current practice• Uniform allocation – best for insoluble queries

• Soluble queries: • Proportional and Uniform allocations are two

extremes with same average performance• Square-Root allocation minimizes Average

Search Size

• OPT (all queries) lies between SR and Uniform• SR/OPT allocation can be realized by simple

algorithms.

93p2p, Fall 05

Replication & Unstructured P2P

epidemic algorithms

94p2p, Fall 05

Replication Policy How many copies

Where (owner, path, random path)

Update Policy Synchronous vs Asynchronous

Master Copy

95p2p, Fall 05

Methods for spreading updates:

Push: originate from the site where the update appeared

To reach the sites that hold copies

Pull: the sites holding copies contact the master site

Expiration times

Epidemics for spreading updates

96p2p, Fall 05

Update at a single site

Randomized algorithms for distributing updates and driving replicas towards consistency

Ensure that the effect of every update is eventually reflected to all replicas:

Sites become fully consistent only when all updating activity has stopped and the system has become quiescent

Analogous to epidemics

A. Demers et al, Epidemic Algorithms for Replicated Database Maintenance, SOSP 87

97p2p, Fall 05


Direct mail: each new update is immediately mailed from its originating site to all other sites

Timely & reasonably efficient

Not all sites know all other sites

Mails may be lost

Anti-entropy: every site regularly chooses another site at random and by exchanging content resolves any differences between them

Extremely reliable but requires exchanging content and resolving updates

Propagates updates much more slowly than direct mail

98p2p, Fall 05


Rumor mongering:

Sites are initially “ignorant”; when a site receives a new update it becomes a “hot rumor”

While a site holds a hot rumor, it periodically chooses another site at random and ensures that the other site has seen the update

When a site has tried to share a hot rumor with too many sites that have already seen it, the site stops treating the rumor as hot and retains the update without propagating it further

Rumor cycles can be more frequent that anti-entropy cycles, because they require fewer resources at each site, but there is a chance that an update will not reach all sites

99p2p, Fall 05

Anti-entropy and rumor spreading are examples of epidemic algorithms

Three types of sites:

Infective: A site that holds an update that is willing to share is hold

Susceptible: A site that has not yet received an update

Removed: A site that has received an update but is no longer willing to share

Anti-entropy: simple epidemic where all sites are always either infective or susceptible

100p2p, Fall 05

A set S of n sites, each storing a copy of a database

The database copy at site s S is a time varying partial function

s.ValueOf: K {u:V x t :T}

set of keys set of values set of timestamps (totally ordered by <

V contains the element NIL

s.ValueOf[k] = {NIL, t}: item with k has been deleted from the database

Assume, just one item

s.ValueOf {u:V x t:T}

thus, an ordered pair consisting of a value and a timestamp

The first component may be NIL indicating that the item was deleted by the time indicated by the second component

101p2p, Fall 05

The goal of the update distribution process is to drive the system towards

s, s’ S: s.ValueOf = s’.ValueOf

Operation invoked to update the database

Update[u:V] s.ValueOf {r, Now{})

102p2p, Fall 05

Direct Mail

At the site s where an update occurs:

For each s’ S

PostMail[to:s’, msg(“Update”, s.ValueOf)

Each site s’ receiving the update message: (“Update”, (u, t))

If s’.ValueOf.t < t

s’.ValueOf (u, t)

The complete set S must be known to s (stateful server)

PostMail messages are queued so that the server is not delayed (asynchronous), but may fail when queues overflow or their destination are inaccessible for a long time

n (number of sites) messages per update

traffic proportional to n and the average distance between sites

s originator of the update

s’ receiver of the update

103p2p, Fall 05

Anti-Entropy

At each site s periodically execute:

For some s’ S

ResolveDifference[s, s’]

Three ways to execute ResolveDifference:

Push (sender (server) - driven)

If s.Valueof.t > s’.Valueof.t

s’.ValueOf s.ValueOf

Pull (receiver (client) – driven)If s.Valueof.t < s’.Valueof.t

s.ValueOf s’.ValueOf

Push-Pulls.Valueof.t > s’.Valueof.t s’.ValueOf s.ValueOfs.Valueof.t < s’.Valueof.t s.ValueOf s’.ValueOf

s s’

s pushes its value to s’

s pulls s’ and gets s’

value

104p2p, Fall 05

Anti-Entropy

Assume that

Site s’ is chosen uniformly at random from the set S

Each site executes the anti-entropy algorithm once per period

It can be proved that

An update will eventually infect the entire population

Starting from a single affected site, this can be achieved in time proportional to the log of the population size

105p2p, Fall 05

Anti-Entropy

Let pi be the probability of a site remaining susceptible after the i cycle of anti-entropy

For pull,

A site remains susceptible after the i+1 cycle, if (a) it was susceptible after the i cycle and (b) it contacted a susceptible site in the i+1 cycle

pi+1 = (pi)2

For push,

A site remains susceptible after the i+1 cycle, if (a) it was susceptible after the i cycle and (b) no infectious site choose to contact in the i+1 cycle

pi+1 = pi (1 – 1/n)n(1-pi)

1 – 1/n (site is not contacted by a node)

n(1-pi) number of infectious nodes at cycle iPull is preferable than

push

106p2p, Fall 05

Anti-Entropy

compare the whole database instance sent over the network

Use checksums

what about recent updates known only in a few sites

+ A list of recent updates (now - timestamp < threshold τ)

Compare fist recent updates, update the ckecksums and then compare the checksums, choice of τ

Maintain an inverted list of updates ordered by timestamp

Perform anti-entropy by exchanging timestamps at reverse timestamp order until their checksums agree

send only the updates, when to stop

107p2p, Fall 05

Complex Epidemics: Rumor Spreading

Initial State: n individuals initially inactive (susceptible)

Rumor planting&spreading:

We plant a rumor with one person who becomes active (infective), phoning other people at random and sharing the rumor

Every person bearing the rumor also becomes active and likewise shares the rumor

When an active individual makes an unnecessary phone call (the recipient already knows the rumor), then with probability 1/k the active individual loses interest in sharing the rumor (becomes removed)

We would like to know:

How fast the system converges to an inactive state (no one is infective)

The percentage of people that know the rumor when the inactive state is reached

108p2p, Fall 05

Complex Epidemics: Rumor Spreading

Let s, i, r be the fraction of individuals that are susceptible, infective and removed

s + i + r = 1

ds/dt = - s i

di/dt = s i – 1/k(1-s) i

s = e –(k+1)(1-s)

An exponential decrease with s

For k = 1, 20% miss the rumor

For k = 2, only 6% miss it

Unnecessary phone calls

109p2p, Fall 05

Residue

The value of s when i is zero: the remaining susceptible when the epidemic finishes

Traffic

m = Total update traffic / Number of sites

Delay

Average delay (tavg): difference between the time of the initial injection of an update and the arrival of the update at a given site averaged over all sites

The delay until (tlast) the reception by the last site that will receive the update during an epidemic

Criteria to characterize epidemics

110p2p, Fall 05

Blind vs. Feedback

Feedback variation: a sender loses interest only if the recipient knows the rumor

Blind variation: a sender loses interest with probability 1/k regardless of the recipient

Counter vs. Coin

Instead of losing interest with probability 1/k, use a counter so that we loose interest only after k unnecessary contacts

s = e-m

There are nm updates sent

The probability that a single site misses all these updates is (1 – 1/n)nm

Simple variations of rumor spreading

m is the traffic

Counters and feedback improve the delay, with counters playing a more significant role

111p2p, Fall 05

Push vs. Pull

Pull converges faster

If there are numerous independent updates, a pull request is likely to find a source with a non-empty rumor list

If the database is quiescent,

the push phase ceases to introduce traffic overhead,

while the pull continues to inject useless requests for updates

Simple variations of rumor spreading

Counter, feedback and pull work better

112p2p, Fall 05

Minimization

Use a push and pull together, if both sites know the update, only the site with the smaller counter is incremented

Connection Limit

A site can be the recipient of more than one push in a cycle, while for pull, a site can service an unlimited number of requests

What if we set a limit:

Push gets better (reduce traffic, since the spread grows exponentially, most traffic occurs at the end

Pull gets worst

113p2p, Fall 05

Hunting

If a connection is rejected, then the choosing site can “hunt” for alternate sites

Then push and pull similar

114p2p, Fall 05

Complex Epidemic and Anti-entropy

Anti-entropy can be run infrequently to back-up a complex epidemic, so that every update eventually reaches (or is suspended at) every site

What happens when an update is discovered during anti-entropy: use rumor mongering (e.g., make it a hot rumor) or direct mail

115p2p, Fall 05

Deletion and Death Certificates

Replace deleted items with death certificates which carry timestamps and spread like ordinary data

When old copies of deleted items meet death certificates, the old items are removed.

But when to delete death certificates?

116p2p, Fall 05

Dormant Death Certificates

Define some threshold (but some items may be resurrected re-appear”)

If the death certificate is older than the expected time required to propagate it to all sites, then the existence of an obsolete copy of the corresponding data item is unlikely

Delete very old certificates at most sites, retaining “dormant” copies at only a few sites (like antibodies)

Use two thresholds, t1 and t2

+ a list of r retention sites names with each death certificate (chosen at random when the death certificate is created)

Once t1 is reached, all servers but the servers in the retention list delete the death certificate

Dormant death certificates are deleted when t1 + t2 is reached

117p2p, Fall 05

Anti-Entropy with Dormant Death Certificates

Whenever a dormant death certificate encounters an obsolete data item, it must be “activated”

118p2p, Fall 05

How to choose partners

Consider spatial distributions in which the choice tends to favor nearby servers

Spatial Distribution

119p2p, Fall 05

Spatial Distribution

The cost of sending an update to a nearby site is much lower that the cost of sending the update to a distant site

Favor nearby neighbors

Trade off between: Average traffic per link and Convergence times

Example: linear network, only nearest neighbor: O(1) and O(n) vs uniform random connections: O(n) and O(log n)

Determine the probability of connecting to a site at distance d

For spreading updates on a line, d-2 distribution: the probability of connecting to a site at distance d is proportional to d-2

In general, each site s independently choose connections according to a distribution that is a function of Qs(d), where Qs(d) is the cumulative number of sites at distance d or less from s

120p2p, Fall 05

Spatial Distribution and Anti-Entropy

Extensive simulation on the actual topology with a number of different spatial distributions

A different class of distributions less sensitive to sudden increases of Qs(d)

Let each site s build a list of the other sites sorted by their distances from s

Select anti-entropy exchange partners from the sorted list according to a function f(i), where i is its position on the list

(averaging the probabilities of selecting equidistant sites)

Non-uniform distribution induce less overload on critical links

121p2p, Fall 05

Spatial Distribution and Rumors

Anti-entropy converges with probability 1 for a spatial distribution such that for every pair (s’, s) of sites there is a nonzero probability that s will choose to exchange data with s’

However, rumor mongering is less robust against changes in spatial distributions and network topology

As the spatial distribution is made less uniform, we can increase the value of k to compensate

122p2p, Fall 05

Replication II:

A Push&Pull Algorithm

Updates in Highly Unreliable, Replicated Peer-to-Peer Systems [Datta, Hauswirth, Aberer, ICDCS03]

123p2p, Fall 05

Replication in P2P systems

P-Grid

CAN

Unstructured P2P (sub-) network of replicas

How to update them?

124p2p, Fall 05

Problems in real-world P2P systems

• All replicas need to be informed of updates.

• Peers have low online probabilities and quorum can not be assumed.

• Eventual consistency is sufficient.

• Updates are relatively infrequent compared to queries.

• Metrics: Communication overhead, latency and percentage of replicas getting the update

Updates in Highly Unreliable, Replicated Peer-to-Peer Systems [Datta, HauswirthAberer, ICDCS03]

125p2p, Fall 05

Problems in real-world P2P systems (continued)

• Replication factor is substantially higher than what is assumed for distributed databases.

• Connectivity among replicas is high.

• Connectivity graph is random.


126p2p, Fall 05

Updates in replicated P2P systems

P2P system’s search algorithm will find a random online replica responsible for the key being searched.

The replicas need to be consistent (ideally)

Probabilistic guarantee: Best effort!

Assumption: each peer knows a subset of the all replicas for an item

online

offline

127p2p, Fall 05


Update Propagation combines

A push phase is initiated by the originator of the update that pushes the new update to a subset of responsible peers it knows, which in turn propagate it to responsible peers they know, etc (similar to flooding with TTL)

A pull phase is initiated by a peer that needs to update its copy. For example, because (a) it was offline (disconnected) or (b) has received a pull request but is not sure that it has the most up-to-date copy

Push and pull are consecutive, but may overlap in time

128p2p, Fall 05

Algorithms

Push:

If replica p gets Push(U, V, Rf, t) for a new (U, V) pair

Define Rp= random subset (of size R*fr) of replicas known to p

With probability PF(t): Push(U, V, Rf U Rp, t+1) to Rp \ Rf

Rf: partial list of peers that have received the update, R number of replicas, fr: fraction of the total replicas which peers initially decide to forward the update (fan-out)

Each message keeps the list of peers were the update has been sent

Parameters:

TTL counter t

PF(t) probability (locally determined at each peer) to send the update

|Rp|size of the random subset - fanout

Item, version, counter (similar to counters, when TTL

129p2p, Fall 05

Selective Push

1

2

2

3t

t

t+1

t+1

extra update message

avoid parallel redundant update:messages are propagated onlywith probability PF < 1 and toa fraction of the neighbors

1

2

2

t

t

t+1

extra update message

avoid sequential redundant update:partial lists of informed neighbors aretransmitted with the message

130p2p, Fall 05

Algorithms

Strategy: Push update to online peers asap, such that later, all online peers always have update (possibly pulled) w.h.p.

Pull:

If p coming online, or got no Push for time T

Contact online replicas

Pull updates based on version vectors

131p2p, Fall 05

Scenario1: Dynamic topology

1 2

45

3

7

6 9

8

132p2p, Fall 05

Scenario2: Duplicate messages

1 2

45

3

7

6 9

8

Necessary messages

Avoidable duplicates

Unavoidable (?) duplicates

133p2p, Fall 05

Results: Impact of varying fanout

How many peers learn about the

update

A limited fanout (fr) is sufficient to spread the update, since flooding is exponential. A large fanout will cause unnecessary duplicate messages

134p2p, Fall 05

Results: Impact of probability of peer staying online in consecutive push rounds

Sigma (σ) probability of online peers staying online in consecutive push rounds:

135p2p, Fall 05

Results: Impact of varying probability of pushing

Reduce the probability of forwarding updates with the increase in the number of push rounds

136p2p, Fall 05

CUP: Controlled Update Propagation in Peer-to-Peer Networks [RoussopoulosBaker02]

PCX: Path Caching with Expiration

Cache index entries at intermediary nodes that lie on the path taken by a search query

Cached entries typically have expiration times

Not addressed: which items need to be updated as well as whether the interest in updating particular entries has died out

CUP: Controlled Update Propagation

Asynchronously builds caches of index entries while answering search queries

+ Propagates updates of index entries to maintain these caches (pushes updates)

137p2p, Fall 05


Every node maintains two logical channels per neighbor:

a query channel: used to forward search queries

an update channel: used to forward query responses asynchronously to a neighbor and to update index entries that are cached at the neighbor (to proactively push updates)

Queries travel to the node holding the item

Updates travel along the reverse path taken by a query

Query coalescing: if a node receives two or more queries for an item pushes only one instance

Just one Update Channel (does not keep a separate open connection per request) All responses go through the update channel: use interest bits so it knows to which neighbors to push the response

138p2p, Fall 05


Each node decides individually:

When to receive updates

through registering its interest + an incentive-based policy to determine when to cut-off incoming updates

When to propagate updates

139p2p, Fall 05


For each key K, node n stores

a flag that indicates whether the node is waiting to receive an update for K in response to a query

an interest vector: each bit corresponds to a neighbor and is set or clear depending on whether the neighbor is or is not interested in receiving updates for K

a popularity measure or request frequency of each non-local key K for which it receives queries

The measure is used to re-evaluate whether it is beneficial to continue caching and receiving updates for K

140p2p, Fall 05


For each key, the authority node that owns the key is the root of the CUP tree

Updates originate at the root of the tree and travel downstream to interested nodes

Types of updates: deletes, refresh, append

Example: A is the root for K3

Applicable to both structured and unstructured

In structured, the query path is well-defined with a bounded number of hops

141p2p, Fall 05


Handling Queries for K:

1. Fresh entries for key K are cached

use it to push the response to the querying neigborhood

2. Key K is not in cache

added and marked it as pending (to coalesce potential bursts)

3. All cached entries for K have expired

push the query

Handling Updates for K:

An update of K is forwarded only to neighbors have registered interest in K

Also, an adaptive control mechanism to regulate the rate of pushed updates

142p2p, Fall 05


Adaptive control mechanism to regulate the rate of pushed updates

Each node N has a capacity U for pushing updates that varies with its workload, network bandwidth and/or network connectivity

N divides U among its outgoing update channels such that each channel gets a share that is proportional to the length of its queue

Entries in the queue may be re-ordered

p2p, fall 05 1 topics in database systems: data management in peer-to-peer systems search &...

Documents

search slide

search time slide

unstructured p2p slide

comparison slide

blind search flood

robust slide

blind search variations

overview blind search